AES-LBBB: AES Mode for Lightweight and BBB-Secure Authenticated Encryption

. In this paper, a new lightweight authenticated encryption scheme AES-LBBB is proposed, which was designed to provide backward compatibility with advanced encryption standard (AES) as well as high security and low memory. The primary design goal, backward compatibility, is motivated by the fact that AES accelerators are now very common for devices in the ﬁeld; we are interested in designing an eﬃcient and highly secure mode of operation that exploits the best of those AES accelerators. The backward compatibility receives little attention in the NIST lightweight cryptography standardization process, in which only 3 out of 32 round-2 candidates are based on AES. Our mode, LBBB , is inspired by the design of ALE in the sense that the internal state size is a minimum 2 n bits when using a block cipher of length n bits for the key and data. Unfortunately, there is no security proof of ALE, and forgery attacks have been found on ALE. In LBBB , we introduce an additional feed from block cipher’s output to the key state via a certain permutation λ , which enables us to prove beyond-birthday-bound (BBB) security. We then specify its AES instance, AES-LBBB , and evaluate its performance for (i) software implementation on a microcontroller with an AES coprocessor and (ii) hardware implementation for an application-speciﬁc integrated circuit (ASIC) to show that AES-LBBB performs better than the current state-of-the-art Remus - N2 with AES-128.


Introduction
Symmetric-key cryptography plays an important role in secure communications owing to its efficiency. In particular, block ciphers are useful primitives to build various symmetric-key schemes. Advanced encryption standard (AES) [Nat01] is arguably the most widely used block cipher worldwide.
A recent trend in the symmetric-key community is to design lightweight schemes that are suitable to implement in extremely resource-restricted environments. Although AES is also efficient for lightweight implementations, the community is seeking designs that are sharply optimized for such implementations. For example, NIST is currently organizing a lightweight cryptography standardization process (NIST LWC) [Nat18].
A main challenge in designing lightweight cryptography schemes is to ensure sufficient security while keeping the implementation efficiency light.
Regarding implementation efficiency, "lightweight" can be interpreted in several different ways, e.g., small memory size in microcontroller implementations, small circuit size in hardware implementations, and low energy consumption in hardware implementations.
Among them, memory size is a common metric for lightweight schemes. Low-memory means smaller circuit area in hardware implementation because register dominates the hardware cost. The benefit of smaller memory size becomes even larger when a side-channel attack is a concern; the common countermeasures such as masking multiply the memory usage for duplicating the sensitive state into several shares.
Achieving low memory is also crucial in software implementation for resource-restricted microcontrollers. In fact, there is a line of work studying the low-memory implementation of cryptography [MM13]. In particular, there are several real-world use cases in which volatile memory (e.g., random access memory (RAM)) is crucial. First, embedded system engineers always have a motivation to minimize memory footprint to reduce the per-chip cost. Second, some platforms have a special secure RAM for storing sensitive data, which is usually limited in size. For example, Microchip's SAM L11 microcontroller has a 256-byte TrustRAM featuring address scrambling and instantaneous wiping [Mic18]. Third, RAM can be temporarily unavailable even in a resource-rich environment. For example, some boot loaders should run cryptography before the initialization of dynamic RAM (DRAM) when only a tiny cache memory is available.
Regarding security, there is a demand for improving the 64-bit security of the conventional AES-Galois/Counter Mode (GCM) and AES-Counter with CBC-MAC Mode (CCM). NIST LWC explicitly requires better security than AES [Nat01], and many candidates have 128-bit security or close to it, for example, PHOTON [IKMP20], and SKINNY-AEAD [BJK + 19].
In the NIST LWC, as listed above, many new designs have been discussed because the goal is to determine a new standard. Among 32 round-2 algorithms in NIST's process, there are only 3 AES-based schemes: COMET [GJN19], mixFeed [CN19], and SAEAES [NMS + 19].
On the other hand, AES-based schemes still have a great practical value because the industry has already invested so much in AES accelerators and coprocessors and will continue the investment for backward compatibility. AES-New Intructions (NI) on the Intel processors is by far the most popular AES accelerator available on the millions of central processing units (CPUs) in the field [Gue10]. Many other CPUs, including ARM and RISC-V, have the AES instructions [MNP + 20]. Besides extending a CPU with new instructions for AES, having an independent AES coprocessor is another way for acceleration, which is common among low-end microcontrollers [Mic20,EHR19]. In particular, some systems use a coprocessor as a root of trust to maintain security even after a main processor becomes compromised, e.g., secure hardware extensions (SHE) for Autosar [Aut19].
There is a line of research for taking advantage of those AES accelerators, and there is another line of work for designing efficient cryptographic algorithms using the AES round function as a building block [JN16, BMR + 14, WP14]. More recently, researchers have been studying efficient cryptography using AES coprocessors to achieve better performance in embedded devices. Elbaz-Vincent et al. proposed a mode of operation for an AES coprocessor [EHR19]; meanwhile Unterstein et al. studied efficient leakage-resilient cryptography using an AES coprocessor as a building block [USS + 20].
In this paper, we propose a new lightweight scheme by spotlighting the fact that "backward compatibility" with AES is as important as security and implementation efficiency for providing real-world usability. That is to say, considering that so many accelerators and coprocessors for AES have become widespread, we aim to design new lightweight and secure schemes that can utilize the AES execution ability of those devices, instead of replacing an encryption algorithm with a brand-new one.
implementation. Besides, the key schedule function (KSF) is applied to the output of the KSF in the previous block. This enables saving of the implementation of KSF −1 . (To use the same key in every block while keeping the key state minimal, after KSF(K) is computed in a certain block, KSF −1 needs to be computed to reproduce K for the next block). Moreover, the round transformation is 4 rounds of AES, thus if AES-NI is available, ALE can be computed fast.
However, as explained above, the AES coprocessors cannot produce intermediate values after each round, thus ALE cannot be implemented fast with the AES coprocessors. In addition, the 128-bit security claimed by the designers of ALE is not supported by security proofs. In fact, several researchers found forgery attacks [KR19, WWH + 13]. Nevertheless, the design of ALE has many attractive features, and obtaining a provable security for ALE-like designs is an interesting challenge.
In summary, by using exiting schemes, Remus-N2 instantiated with AES would be the best choice for microcontrollers having the AES coprocessors. Investigating a new mode with a smaller memory requirement than Remus-N2 is of great interest. The ALElike structure with provable security is also an interesting direction. Those features are summarized in Table 1.
To illustrate state-of-the-art clear, we also list several features of GCM, OCB3, COFB, and SAEB, which only offer birthday-bound level security in Table 1. GCM and OCB3 enable parallel implementations, thus can be fast but require a relatively large state. COFB and SAEB are sequential modes, thus do not require a large state. SAEB is the smallest but it does not achieve rate 1, while COFB does. Note that security of GCM, OCB3, COFB, and SAEB has been proved in the standard model, which could not achieved by ALE and Remus-N2.

Our Contributions
In this paper, we propose a new mode-of-operation for authenticated encryption that is efficiently implemented on the devices having an AES coprocessor and offers almost 128 bits of provable security. We call our mode LBBB and a particular instance for AES AES-LBBB. AES-128 uses a 128-bit data state and a 128-bit key state, thus we need to use at least a 256-bit state to implement it. To be optimized for lightweight implementations, LBBB is designed so that the state size can be minimal. Fig. 2, is a generic framework to construct authenticated encryption schemes with the minimum state size, which is 2n bits when using a block cipher of length n bits for the key and the data states. The 2n-bit internal state of LBBB consists of an n-bit key and an n-bit data states. The key state is always secret, whereas the data state is secret during the hash function computation but becomes public when encrypting plaintext blocks. LBBB is designed by extending the idea of ALE so that it achieves almost n-bit security. To achieve the security level for online queries (or data complexity), the size of the update state in every block must be 2n bits to avoid the birthday attack on the internal state. As ALE, the key state of LBBB is updated by an update-permutation π. Additionally, the key state is updated by a block cipher output via a permutation λ, which is the difference from ALE. This additional procedure ensures that a collision of the internal state requires a query complexity of at least 2 n , because for two data block sequences, after distinct data blocks are processed, the difference is propagated in the whole internal state of 2n bits. In addition to these two permutations, LBBB uses a permutation η for the domain separation, which distinguishes whether the last data block is full or not. In the mode level, exact specifications of π, λ, and η are not specified.

LBBB. LBBB, shown in
For efficiency, LBBB is highly efficient, similarly to the Remus-N2 mode, in the sense that both achieve rate-1 (a block cipher is performed for each n-bit plaintext block). On Figure 2: Encryption of LBBB. N , K, A and M are a nonce, a secret key, associated data, and a plaintext. id(|A|, |M |) is a constant to distinguish whether A = ε or not and whether M = ε or not.
is an i-th pair of associated data blocks, M i /C i is a plaintext/ciphertext block, and T is a tag. π, λ, and η are permutations.
the other hand, LBBB has the minimum state size that is the advantage over Remus-N2.
For security, we prove that LBBB achieves O(n − log 2 n)-bit security. Hence, using a block cipher with n = 128, the resultant scheme achieves 121-bit security. In our proof, π, λ, and η are assumed to be chosen from a wide class of permutations with properties like a universal hash function, and such permutations can be realized using lightweight operations. Example permutations are powering-up-based schemes [Rog04] that use a multiplication by 2 over GF(2 n ). Note that in the security proof, the underlying block cipher is assumed to be ideal, but even if a practical block cipher is implemented, the construction is secure as long as it resists related-key attacks. Also note that to achieve almost n-bit security and have the minimum state size simultaneously, the key state must be updated as well as the data state because of the birthday attack on the internal state. The structural condition requires a proof in the ideal model. 1

AES-LBBB.
For a concrete AES-based scheme called AES-LBBB, an instance of λ and π can be chosen so that it can be further optimized for low-memory software implementations or low-area hardware implementations depending on the design goal. This paper's main goal is for the construction to be efficiently implemented on the devices having the AES coprocessor and offer almost 128 bits of provable security. We use a software-friendly multiplication by 2 8 as λ and π, which can be implemented efficiently, particularly for byte-oriented ciphers like AES. The main computational cost for each message block is a single execution of AES, thus it can be implemented fast with the AES coprocessors.
We evaluate the performance of AES-LBBB in the two different platforms: (i) software implementation on a microcontroller with an AES coprocessor [Mic20] and (ii) hardware implementation for an application-specific integrated circuit (ASIC). We compare the results with the current state of the art (Remus-N2 [IKMP20] instantiated with AES-128) implemented with the same design policy. If we use a block cipher's key schedule function KSF as π, the construction becomes similar to ALE with respect to the data flow. Different from ALE, the feed function λ slightly increases the implementation cost, but it enables us to prove the security of the construction. A limitation is that π needs to satisfy a certain property, which cannot be satisfied by non-linear KSF, including AES's KSF. Hence AES's KSF cannot be used as π.
Here, for an academic purpose, we discuss the design of KSF that achieves provably secure ALE-like modes or efficient implementations such that the implementation of KSF −1 can be removed. It is known that AES-192 and AES-256 are non-ideal in the related-key setting because of their KSF [BKN09,BK09], and several researchers have proposed alternative KSFs for AES [MHM + 02, Nik10, KLPS17]. Among them, Khoo et al. [KLPS17] proposed a KSF that only permutes byte positions, and we show that LBBB can be instantiated efficiently with their KSF. We also show that the KSF of KATAN [CDK09] is very suitable for LBBB, which can achieve a provably secure ALE-like construction.
To avoid having misunderstandings, we would like to stress that our primary goal is reducing the memory size which is beneficial in resource-constrained devices; the BBBsecure parallel mode can be better if the memory size is not a matter. Discussions for design extensions are not limited to lightweight applications as they consider AES-NI.

Paper Outline
Section 2 introduces preliminary knowledge. Section 3 specifies our mode LBBB for a class of permutations λ and π. Section 4 describes security proofs of our mode. Section 5 specifies the AES-based instance AES-LBBB and explains its implementation results. Section 6 discusses a design extension. Section 7 concludes this paper.
. For a non-empty set T , T $ ← − T means that an element is chosen uniformly at random from T and assigned to T . The concatenation of two bit strings X and Y is written as X Y or XY when no confusion is possible. For integers 0 ≤ i ≤ n and X ∈ {0, 1} n , let msb i (X) resp. lsb i (X) be the most resp. least significant i bits of X, and let |X| be the bit length of X, i.e., |X| = n. For an integer n ≥ 0 and a bit string X, we denote the parsing into fixed-length n-bit strings as (X 1 , . . . , X ) where if X = ε, then X = X 1 · · · X , |X i | = n for i ∈ [ − 1], and 0 < |X | ≤ n; if X = ε, then = 1 and X 1 = ε. For an integer n > 0, let ozp n : {0, 1} ≤n → {0, 1} n be a one-zero padding function: for X ∈ {0, 1} ≤n , ozp n (X) = X if |X| = n; ozp n (X) = X 10 n−1−|X| if |X| < n.

Block Cipher and Block-Cipher-based AE
A block cipher (BC) is a set of permutations indexed by a key. Throughout this paper, the block and key sizes in bits are fixed to n. An encryption of BC is denoted by E : We follow the security definition given in Namprempre, Rogaway and Shrimpton [NRS14], called nae-security, and prove the security of LBBB in the ideal cipher model. naesecurity in the ideal cipher model is the indistinguishability between (Π K [E], E, E −1 ) and ($, ⊥, E, E −1 ), where a key is defined as K $ ← − K, an ideal cipher is defined as E $ ← − BC, $ is a random-bits oracle that has the same interface as Π.Enc K [E] and for an encryption query (N, A, M ) returns a random bit string of length |Π.Enc K [E](N, A, M )|; ⊥ is an oracle that returns reject for any decryption query. Throughout this paper, we call queries to Π.Enc K [E]/$ "encryption queries," Π.Dec K [E]/ ⊥ "decryption queries," queries to E "forward offline queries," and queries to E −1 "inverse offline queries." The nae-security advantage function of an adversary A that returns a decision bit b ∈ {0, 1} after interacting with Π.Enc K [E], Π.Dec K [E], E, E −1 in the real world or with $, ⊥, E, E −1 in the ideal world is defined as ,E,E −1 resp. A $,⊥,E,E −1 is an output of A in the real world resp. the ideal world, and the probabilities are taken over K, E, $, and A. We demand that A is a nonce-respecting adversary, i.e., the same nonce is not repeated for encryption queries, never asks a trivial query, and never repeats a query. A trivial decryption query (N, A, C, T ) is that there is a prior encryption query (N, A, M ) such that (C, T ) = Π.Enc K [E](N, A, M ), and a trivial forward (resp. inverse) offline query (K,X) (resp. (K,Ŷ )) is that there is a prior inverse (resp. forward) offline query (K,Ŷ ) (resp. (K,X)) such thatX = E −1 (K,Ŷ ) (resp.Ŷ = E(K,X)).

Multi-Collision
The upper bound of the number of multi-collision elements is used in our security proof. For a positive integer s ≥ 2, let S be a set of size s. For sampling S 1 , . . . , S q $ ← − S, we denote by N q,s the maximum number of multi-collision elements, i.e., N q,s = max S∈S |{i : S i = S}|. Let mcoll(q, s) be the expected value of N q,s . The upper bound of mcoll(q, s) is given in Chakraborty et al. [CJN20] and the following lemma.

Design
We design a BC-based AE scheme, called LBBB. LBBB is highly secure, meaning almost n-bit security; is highly efficient, meaning rate-1 (a BC is processed once for each n-bit plaintext block); and has the minimum state size. The minimum state size is 2n bits when using a block cipher of n-bit data block and n-bit key.
As mentioned in Section 1, LBBB is designed by extending the ALE idea so that it achieves almost n-bit security. To achieve the security level for online queries (or data complexity), the size of the update state must be 2n bits due to the birthday attack on the internal state. As ALE, the key state of LBBB is updated by an update-permutation π. Additionally, the key state is updated by a block cipher output via a permutation λ, which is the difference from ALE. The additional procedure ensures that a collision of the internal state requires the 2 n online complexity, because for two data block sequences, after distinct data blocks are processed, the difference is propagated in the whole internal state of 2n bits. In addition to these two permutations, LBBB uses a permutation η for domain separation, which distinguishes whether the last data block is full or not.
The specification of LBBB is given in Algorithm 1 and is also depicted in Fig. 2. LBBB.Enc is the encryption of LBBB, and LBBB.Dec is the decryption. LBBB.Hash processes a secret key K, a nonce N , and AD A. LBBB.Enc.Main processes a plaintext M and generates a ciphertext C and a tag T . LBBB.Dec.Main processes a ciphertext C and checks the integrity. If the input is not forged, it returns the plaintext M . In for the input X. In our proof, we assume that π, λ, and η have the following properties. Here, max denotes the maximum number of BC calls in LBBB.Enc or LBBB.Dec.
• π is linear and has the property that for any Y ∈ {0, 1} n and i, j ∈ ( max ] such that i = j, the equation π i (X) ⊕ π j (X) = Y offers a unique solution for X.
• λ is linear and has the property that for any Y ∈ {0, 1} n , the equation X ⊕ λ(X) = Y offers a unique solution for X.
• η is linear and has the property that for any Y ∈ {0, 1} n and i, j ∈ (2] such that i = j, the equation η i (X) ⊕ η j (X) = Y offers a unique solution for X.
Candidates of these permutations are discussed in Section 5.1. We next explain the reasons the above properties are introduced.
• π: In the security proof, we need to evaluate the collision probability of the key state.
Using the property, we can use the randomness of the key K for a collision of key state elements at distinct blocks, meaning that for i = j and some Y , the collision is denoted by π i (K) ⊕ π j (K) = Y .
• λ: The key state is updated as KS ← π(KS) ⊕ λ(S) ⊕ C i and the ciphertext block The property of λ ensures that S, which is the output of E, is not canceled out in the key state update, thus the randomness of S can be used to evaluate the collision probability of key state elements.
• η: This property protects the length extension attack.

Security Bound
We show that LBBB achieves (n − log 2 n)-bit security. The security bound is given in the following theorem. The proof is given in Section 4.
Theorem 1. Let A be an adversary making at most q p offline queries, q e encryption queries with at most σ e BC calls in total, and q d decryption queries with at most σ d BC calls in total such that 2q p + σ ≤ 2 n . Let σ := σ e + σ d the total number of BC calls by all queries. Then, we have By using Lemma 1, the numbers of multi-collision elements is upper bounded by O(n). Thus, the advantage function is upper bounded by O((n(q p + σ d ) + σ)/2 n ), and LBBB achieves (n − log 2 n)-bit security.

Proof of Theorem 1 4.1 Overview
We give an overview of the security proof. The goal of this proof is to upper-bound the probability of distinguishing between (LBBB.Enc K [E], LBBB.Dec K [E], E, E −1 ) and ($, ⊥, E, E −1 ). To have the distinguishing probability, we find structural differences between (LBBB.Enc K [E], LBBB.Dec K [E]) and ($, ⊥), as (LBBB.Enc K [E], LBBB.Dec K [E]) define outputs using E whereas ($, ⊥) are monolithic and don't use E, and upper-bound the probabilities that the differences occur.
The first difference comes from collisions in inputs to E. The collisions fall into the following two types.
• The first type is of collisions in inputs to E defined by online queries. The inputs are defined by LBBB.
have structures of iterating E, an input collision propagates the output. On the other hand, $ and ⊥ don't have such structures. Thus, the input collision makes a difference between the real and ideal worlds.
• The second type is of collisions between inputs of E defined by online and offline queries. As ($, ⊥) are monolithic, a collision between inputs defined by online queries and defined by offline queries makes a difference between the real and ideal worlds.
Thus, the probability that the difference occurs is introduced in the distinguishing probability. Regarding inputs to E whose plaintext elements are revealed via the corresponding ciphertext blocks, we need to upper-bound the collision probability using only the randomnesses of the key elements of n bits. However, a naive birthday analysis on the key elements degrades the security level to n/2 bits. To overcome the issue, we use the multi-collision technique on the ciphertext blocks. Regarding other inputs, the plaintext and key elements are not revealed, thus we can use the randomnesses of 2n bits. Then, we will show that the probability that the difference occurs is at most O(nq p /2 n + nσ d /2 n + σ 2 /2 2n + q p σ/2 2n ).
The second difference comes from a key recovery in the real world, as using a key one can calculate all outputs of LBBB. Using the randomness of the key of n bits, we have the upper bound of the probability for the difference: O((q p + σ)/2 n ).
The third difference comes from the difference between the randomnesses of outputs of LBBB.Enc K [E] and of $, where outputs of LBBB.Enc K [E] are defined using E whereas outputs of $ are chosen uniformly at random. As an ideal cipher E, fixing the key element, becomes an n-bit random permutation, in LBBB.Enc K [E] if the same key element appears twice, the ciphertext elements are not truly random values. Thus, the probability that the difference occurs is introduced in the distinguishing probability. We will show that the probability is at most O(σ 2 /2 2n ).
The last difference comes from the difference between outputs of LBBB.Dec K [E] and of ⊥, as LBBB.Dec K [E] returns a plaintext if a forgery succeeds. Thus, the probability that the difference occurs is introduced in the distinguishing probability. We will show that the probability that the difference occurs (a forgery succeeds) is at most O(q d /2 2n ).
In the following, we give the detail of the security proof. Our proof uses the coefficient H technique [Pat08], where the above differences are defined as bad events.
Online query numbers 1 to q e are assigned to encryption queries, and those q e + 1 to q e + q d are assigned to decryption queries. Hence, for α ∈ [q e ], α-th encryption query is said to be the α-th online query, and for β ∈ [q d ], β-th decryption query is said to be the (q e + β)-th online query.
For α ∈ [q e + q d ], values/variables defined at the α-th online query are denoted by using the superscript of (α), and the lengths a and m at the α-th online query are denoted by a α and m α , respectively. Let α := a α + m α be the length of data blocks at the α-th online query. The initial BC call whose input is (K, N (α) ) is regarded as the 0-th BC call, and an input-output triple of an i-th BC call is denoted by (K For α ∈ [q p ], the input-output triple at the α-th offline query is denoted by We consider an array of data blocks with distinguishing identifiers δ (α) i for an α-th Algorithm 2 Dummy Internal Values for Initial Blocks 13: 16: 17: end for is the last block and a one-zero string is padded; δ

Dummy Internal Values in Ideal World
In the ideal world, after making all queries, Algorithms 2, 3, and 4 are performed. First, Algorithm 2 defines a dummy key K, dummy outputs Y 18:

end for
19:

Adversary's View
In this proof, after making all queries, an adversary is permitted to obtain all (dummy) internal values. Note that the revealed internal values do not reduce adversary's advantage. The adversary's view is summarized in a transcript τ , 3 which is equal to where rv (qe+α) is a response of the α-th decryption query: reject or a (decrypted) plaintext. Note that query-response tuples 3 Defining the dummy internal values in the ideal world can be seen as a simulator that mimics the internal values in the real world.

Coefficient H Technique
This proof uses the coefficient H technique [Pat08]. Let T R be a transcript in the real world obtained by sampling K and E. Let T I be a transcript in the ideal world obtained by sampling $ and dummy values. We call a transcript τ valid if Pr[T I = τ ] > 0. Let T be all valid transcripts. Then, and the statistical distance SD(T R , T I ) can be upper-bounded using the following lemma.
Here, T is partitioned into two transcripts: good transcripts T good and bad transcripts T bad .
In the following proof, good and bad transcripts are defined.

Definitions of Good and Bad Transcripts
A set of bad transcripts T bad satisfies one of the following bad events, and a set of good transcripts T good is defined as T good = T \T bad .
In the ideal world, dummy internal values for online queries (except for LBBB.Dec.Main) are defined independently of E. On the other hand, in the real world, internal values and responses of offline queries are defined using E. Thus, bad events are defined so that the difference appears.
bad 1 , bad 2 , bad 3 : In the ideal world, even if a collision occurs between dummy internal values and offline query-response triples, the output for the online query is defined independently of the output for the offline query. The events handle the collisions, and it can be ensured that if these bad events do not occur, an adversary cannot distinguish between the real and ideal worlds using the difference between online and offline queries.
bad 4 , bad 5 , bad 6 : In the ideal world, for online queries, dummy outputs with distinct prefix data blocks are independently defined, even if a dummy input collision occurs. On the other hand, in the real world, if an input collision occurs for online queries, the outputs are the same, since all outputs are defined by E. Hence, the bad events handle the collisions for online queries, and it can be ensured that if the bad events do not occur, an adversary cannot distinguish between the real and ideal worlds using the difference for internal values defined by online queries. ).
bad 5 : ∃α ∈ [q e + q d ], β ∈ [q e + 1, q e + q d ], bad 7 : The last event handles a forgery. In the ideal world, all responses of decryption queries are reject, whereas in the real world, a plaintext is sometimes returned. It can be ensured that if a bad event does not occur, an adversary cannot distinguish between the real and ideal worlds using a forgery.
As mentioned above, these bad events handle the differences between the real and ideal worlds. Thus, it can be ensured that an adversary cannot distinguish between the real and ideal worlds as long as no bad event occurs, and for any τ ∈ T good , Pr

Evaluating Pr[T I ∈ T bad ]
Without loss of generality, assume that an adversary aborts if one of the bad events occurs. Thus, for i ∈ [7], bad i occurs as long as other bad events have not occurred. Then, we have These upper bounds are given in the following sections. Using the upper bounds, we have

Upper Bound of Pr[bad 1 ]
Recall the definition of bad 1 below.

Upper Bound of Pr[bad 2 ]
Recall the definition of bad 2 below.
We upper-bound Pr[bad 2 ] using the following sub-events.
is not revealed before finishing all queries.
is not revealed before finishing all queries.
bad 2,1 : Unlike the former analyses, we cannot use the randomness of Y (β) i−1 that is revealed via the corresponding ciphertext block. To overcome this issue, we use the number of multi-collision elements withX (α) = X is at most mcoll(σ e,2 , 2 n ), thus we have Pr[bad 2,1 ] ≤ q p mcoll(σ e,2 , 2 n )/2 n .

Upper Bound of bad 3
Recall the definition of bad 3 below.

Upper Bound of bad 4
Recall the definition of bad 4 below.
is satisfied due to N (α) = N (β) . We thus assume that 1 < i or 1 < j is satisfied.
Regarding the collision X Regarding the collision Y j−1 , and is chosen uniformly at random from at least 2 n − q e elements in {0, 1} n . Using the randomness of Y Summing the upper bound 2/2 n (2 n − q e ) for each α, β, i, j, we have

Upper Bound of bad 5
Recall the definition of bad 5 below.
First, we consider the collisions K We upper-bound the collision probability using the following sub-events.
Note that regarding the collision X j−1 is a dummy internal value in LBBB.Hash and thus is chosen uniformly at random from at least 2 n − σ elements.
a β , which is chosen uniformly at random from at least 2 n − σ elements and used to define K 0 , which is chosen uniformly at random from at least 2 n − q elements and used to define K j ] ≤ 1/2 n due to K and the property of π. Thus, for each each α, β, i = j, we have Pr[K
Summing these upper bounds, we have

Regarding the collisions K
, the analysis is similar to that of Summing these upper bounds, we have

Upper Bound of bad 6
Recall the definition of bad 6 below.
As K is chosen uniformly at random from {0, 1} n , we have

Upper Bound of bad 7
Recall the definition of bad 7 below. α ∈ [q p ]} be input-output triples that might be used as dummy internal values of LBBB.Dec.Main. In this analysis, we take into account whether dummy internal values of LBBB.Dec.Main are defined in L or not. We thus consider the following sub-events.
bad 7,1 [α]: As T (α) is defined using a dummy key K bad 7,2 [α]: We assume that i * is the maximum index such that (K is chosen uniformly at random from at least 2 n − q p − σ d elements and K (α) i * is defined using a dummy key K, we have Pr[bad 7,2 [α]] ≤ q p /(2 n (2 n − q p − σ d )). Definition 1. For two input-output triples (X , K , Y ), (X * , K * , Y * ) ∈ L, the relation between these triples is denoted by have the following relations: for each u ∈ [r], The upper bound of the number of multi-collision sequences is given in the following lemma.

Lemma 3. Let be a positive integer. Then we have
Exp max Using the upper bound, denoted by r * , and the randomness of a dummy key, we have Pr[bad 7,3 [α]] ≤ r * /2 n . ,

Analysis for Good Transcripts
Fix τ ∈ T good . For W ∈ {I, R} and a set V , T W V denotes an event that T W is compatible with the values in V . Let ) : α ∈ [q e + 1, q e + q d ], i ∈ [a α + 1, α + 1] , and Let τ e := τ e,1 ∪ τ e,2 . Then, we have

Evaluating Pr[T W τ e |T W τ 0 ] for W ∈ {I, R}
As all input-output triples in τ e are distinct by ¬bad 4 , we have |τ e | = σ e − q e . Let N e [K ] := |{(K * , X * , Y * ) ∈ τ e : K * = K }| be the number of input-output triples in τ e whose key elements equal K .
Pr[T R τ e |T R τ 0 ] is evaluated. Note that all key elements in τ e are not K. As

Evaluating Pr[T
∈ τ e }| be the number of input-output triples in τ d,1 whose key elements equal K and that are not in τ e . In both real and ideal worlds, is distinct from other outputs in τ e ∪ τ d,1 whose key elements equal K and

Evaluating Pr[T
∈ τ e }| be the number of input-output triples in τ p ∪ τ d,2 whose key elements equal K and that are not in τ e . In the ideal world, we have In the real world, we have .
Thus, we have Cond 2-i Cond 3 forward or encryption

Pr[TI =τ ]
By the above results, we have

Proof of Lemma 3
, i) in L satisfies at least one of the following conditions.
• Cond 2-i where i ∈ [1, − 1]: (K u,i , X u,i , Y u,i ) is defined by a forward offline query or at an encryption query, and (K u,i+1 , X u,i+1 , Y u,i+1 ) is defined by an inverse offline query.
• Cond 3: (K u, , X u, , Y u, ) is defined by a forward offline query or at an encryption query.
See also Fig. 4. We upper-bound the number of multi-collision sequences for each condition. Cond 1. The first blocks (K u,1 , X u,1 , Y u,1 ) are defined by inverse offline queries and the X u,1 values are the same. As each X u,1 is chosen uniformly at random from at least 2 n − q p elements in {0, 1} n , the number of multi-collision sequences with Cond 1 is at most mcoll(q p , 2 n − q p ).
Cond 3. The last blocks (K u, , X u, , Y u, ) are defined by forward offline queries or at an encryption query, and the XOR values π(K u, ) ⊕ λ(Y u, ) are the same. As each Y u, is chosen uniformly at random from at least 2 n − q p elements in {0, 1} n , the number of multi-collision sequences with Cond 3 is at most mcoll(q p + q e , 2 n − q p ).
Cond 2-i. The number of multi-collision sequences with this condition is upperbounded by v: the number of pairs of query-response triple ( is defined by a forward offline query or at an encryption query, and (K j , X j , Y j ) is defined by an inverse offline query. Fix D, s, j and consider the following cases.
• (K j , X j , Y j ) is defined after (K j , X j , Y j ) is defined. For each (K j , X j , Y j ), the number of inverse offline queries such that the XOR values K j ⊕ λ(X j ) are the same is at most mcoll(q p , 2 n − q p ). Thus, the probability that the output Y j equals one of the values X j is at most mcoll(q p , 2 n − q p )/(2 n − q p ).
Similarly to the former case, for each (K j , X j , Y j ), the number of multi-collisions for XOR values defined by forward offline queries or at encryption queries is at most mcoll(q p + σ e , 2 n − q p ). Thus, the probability that the output X j equals one of the Y j values is at most mcoll(q p + σ e , 2 n − q p )/(2 n − q p ).
Thus, the number of multi-collision sequences with Cond 2-i is at most (q p +σ e )·mcoll(q p + σ e , 2 n − q p )/(2 n − q p ). Finally, using the upper bounds of the number of offline sequences for each case, we obtain the upper bound in Lemma 3.

AES-LBBB: Design and Performance Evaluation
We describe the instantiation of LBBB with AES-128, namely AES-LBBB, followed by its performance evaluation under two different platforms: (i) software implementation on a microcontroller with an AES accelerator [Mic20] and (ii) hardware implementation for ASIC using the NanGate 45-nm standard cell library [Nan].

Specification of AES-LBBB
To achieve the BBB security using AES-128, we design the concrete instantiation of AES-LBBB. This section discusses the considerations in choosing the π, λ, and η functions for AES-LBBB.

The π and λ functions
π should satisfy the following properties: • π is linear and has the property that for any Z ∈ {0, 1} n and i, j ∈ ( max ] such that i = j, the equation π i (S) ⊕ π j (S) = Z offers a unique solution for S.
Notably, the period of π should be at least max : the number of maximum blockcipher calls in LBBB. λ should satisfy the same properties except the period: • λ is linear and has the property that for any Z ∈ {0, 1} n , the equation S ⊕ λ(S) = Z offers a unique solution for S.
Many symmetric key algorithms, including PMAC1 and its variants, use the functions with the same properties as π. Multiplication over GF(2 n ), e.g., π(S) = S × 2, has been used in those conventional works because of the long period (2 n − 1) and efficient hardware implementation. Another advantage of choosing the linear function is that we can unify π and λ by choosing π = λ because (3) With the above considerations, we choose π = λ = ×2 8 for AES-LBBB. We choose ×2 8 instead of ×2 because the bytewise operation is more software friendly. Changing the multiplier shortens the maximum number of message blocks to 2 120 − 1 but has no practical impact. More specifically, we use the finite field determined by the AES-GCM's irreducible polynomial x 128 + x 7 + x 2 + x + 1 [Nat07] for further backward compatibility.
Other promising choices are bytewise LFSRs, which are given in for example [Sar09]. The LFSR in [Sar09] uses a tower field representation with the following irreducible polynomials: the irreducible polynomial of the sub-field GF(2 8 ) is α 8 + α 7 + α 3 + α 2 + 1 and the irreducible polynomial of the field GF(2 128 ) over the sub-field is x 16 + x 7 + x + α.

The η function
η should satisfy the same properties as π except for the period: • η is linear and has the property that for any Z ∈ {0, 1} n and i, j ∈ (2] such that i = j, the equation η i (S) ⊕ η j (S) = Z offers a unique solution for S.
Unlike λ, the benefit of choosing the field multiplication is small for η, and it is more meaningful to choose a more efficient function exploiting the shorter period. For AES-LBBB, we choose the following function for η: which is efficient both in software and hardware implementations. For other choices, bytewise LFSRs, for example [Sar09], are promising choices as well as π and λ.

Target for comparison
To make a consistent performance comparison, we instantiate the state-of-the-art authenticated encryption with associated data (AEAD) with AES-128 and implement it with the same design policy as AES-LBBB: we set Remus-N2 [IKMP20] as the competitor, which is the blockcipher-based variant with the BBB security from the Romulus/Remus family. More specifically, we implement Remus-N2 instantiated with AES-128, namely Remus-N2-AES 4 . Table 2 compares the memory sizes of AES-LBBB and Remus-N2-AES in bits. AES-LBBB's main advantage is its smaller memory footprint: we can implement AES-LBBB with 256 bits of memory, which is smaller than that of Remus-N2-AES by 128 bits. With those memory capacities, we need to overwrite and reuse the memory space for the secret key during the operation and to feed the same secret key for the next operation. To preserve the secret key for the next operations, we need another 128-bit memory, and AES-LBBB and Remus-N2-AES use 384 and 512 bits in total, respectively. Table 2 also shows the memory size needed with TI, which we will discuss later in Section 5.6.
which the microcontroller provides in its mask read only memory (ROM). This line of code implements a single AES-128 call 5 : it encrypts a 16-byte message in src with the secret key in key and writes the 16-byte ciphertext to dst. We can set the same address for src and dst to overwrite the message with ciphertext in place.

Interface
The implementations are compliant with the SUPERCOP's interface for AEAD [lab20] and provide the crypto_aead_encrypt and crypto_aead_decrypt functions. The depth of the function calls is important for ROM and RAM sizes because nested functions reduce the code size at the cost of increasing stack usage. For a rigorous optimization, we limit the depth from the top-level functions (crypto_aead_encrypt and crypto_aead_decrypt) to one level. We allocate a space in global memory (cf. in the stack) for storing the sensitive intermediate values assuming that there is a special secure region in memory (e.g., TrustRAM in SAM L11 [Mic18]). Meanwhile, we design the sub-function interfaces so the sensitive intermediate data will not stay in the stack: sensitive data is treated only in the last-level functions that do not use the stack for local variables.

Software Performance Evaluation
Procedure. We describe the codes in C without an assembly-level optimization and compile them using gcc version 6.3.1 on Atmel Studio 7.0 6 . We use the size command for evaluating static memory allocation. Meanwhile, we evaluate the stack usage by inspecting the generated object code in between the SUPERCOP's interface and the AES function (crya_aes_encrypt).
We measure the execution time by running the implementations on the target chip (the SAM L11 microcontroller) because the simulator with a cycle counter does not support the hardware AES accelerator. We assert a general-purpose input/output (GPIO) pin during the execution and measure the pulse width using an oscilloscope. Then, we calculate the number of cycles by multiplying the time duration with a clock frequency. The target microcontroller runs at 16 MHz, the maximum frequency of the chip's internal oscillator 7 . We evaluate the execution time for a particular test vector composed of a 16-byte associated data and a 256-byte message; both AES-LBBB and Remus-N2-AES call AES 19 times for processing the test vector. Table 3 shows the software performances of our AES-LBBB and Remus-N2-AES implementations. AES-LBBB and Remus-N2-AES use 32 and 48 RAM bytes for storing the sensitive intermediate values as predicted in Table 2. Meanwhile, both implementations use 88 bytes in the stack for storing non-sensitive data, e.g., preserved general-purpose registers, function arguments, and loop counters. Reducing the sensitive data's memory size by 16 bytes can be considerable because a special memory region for sensitive data is sometimes very limited in size. For example, the SAM L11 microcontroller provides TrustRAM featuring address scrambling and instantaneous wiping [Mic18], which is limited to 256 bytes only. The proposed design achieves a smaller RAM size than AES-GCM, as summarized in Table 1, even considering the SAM L11 microcontroller's AES-GCM acceleration; the microcontroller provides an accelerator for GF(2 128 ) multiplication, which does not contribute to reducing the RAM size.  Table 3 shows the speed in the number of cycles to finish the entire operation for processing the test vector, including initialization, MAC processing, and encryption. It involves 19 AES calls: 1 for initialization, 1 for processing AD, 16 for processing message, and 1 for generating tag. AES-LBBB achieved roughly 32,816 cycles, which is 83% of Remus-N2-AES. Since the number of AES calls is the same between AES-LBBB and Remus-N2-AES, the difference comes from the non-AES operations such as ×2 8 , XOR, and the ρ function. A single AES call takes 915 cycles 8 , i.e., 57.2 cycle/byte, and the AES occupies 17,385 (= 915 × 19) cycles in total. In other words, non-AES operations occupy 15,431 cycles or 47.0% of the total execution time in the AES-LBBB implementation. Less frequent calls to non-AES operations are also advantageous to make the code smaller for the limited function depth, and AES-LBBB's ROM size (1,422 bytes) is smaller than that of Remus-N2-AES by 27%.

Comparison with software implementation of lightweight primitives
We set out to discuss the benefit of using an AES coprocessor compared to optimized software implementation of the state-of-the-art lightweight block ciphers. First, we discuss the memory size, which is the AES-LBBB's primary goal; using a coprocessor provides substantial advantages in memory size over a software implementation of newer lightweight primitives. The ROM needed for using a coprocessor is just a simple interface, which is smaller than a complete block cipher implementation. The coprocessor approach also achieves a smaller RAM because there is no need for intermediate variables and nested function calls.
Second, we discuss the speed. That is not the AES-LBBB's primary goal, and the other mode/primitive can be better, especially when rich memory is available. AES coprocessors' performance depends on a particular chip, but chip vendors typically design them for average use cases, and rigorously optimized software implementation can outperform them. Although the speed comes at the cost of memory, the fixslicing Skinny-128-128 implementation achieves 58 cycle/byte [AP21], which is almost the same as the SAM L11's AES coprocessor with 57 cycle/byte based on the real measurement.
We finally discuss briefly software performance in high-end processors. AES-LBBB enjoys the AES instructions such as AES-NI [Gue10, MNP + 20]. Meanwhile, ROM and RAM are cheaper in those processors, and trading memory size with speed using a parallel mode (e.g., OCB) could be better for those targets.

Hardware Implementation
Although the legacy devices with AES coprocessors enjoy the AES-LBBB's backward compatibility, the newer devices have the opportunity to use upgraded coprocessors to communicate with those legacy devices more efficiently. We set out to discuss the design of such upgraded AES coprocessors.

Interface
We design coprocessors that provide a set of commands for processing a block at a time, which is common in the previous works [NS20]. We decompose the target algorithm into the operational units, as shown in Figure 5. We can realize the AES-LBBB's AD processing, encryption, and decryption by combining those commands. We assume an external main controller that feeds message blocks and dispatches the commands in an appropriate sequence.

AES Implementation
Our design follows the byte-serial architecture [MPL + 11] commonly used for the conventional compact AES implementations. Figure 6 shows the state and key arrays, flip-flops arranged in the 4 × 4 array, that efficiently realize the AES's operations in place. We use a particular variant that makes the column-oriented serialization [Sug20], which respects the AES' native byte order.
The on-the-fly key schedule is common among compact hardware implementations. As a downside, however, the on-the-fly key schedule overwrites the key register in place, and thus we lose the original AES key after calling an AES encryption. This is a problem for AES-LBBB and Remus-N2-AES that use the AES key for processing the next blocks. A straightforward workaround is to add an extra 128-bit register storing the AES secret key, but it increases the number of registers and devastates the low-memory advantage of our scheme.
Another way is to implement the inverse key schedule that reverts the final-round key to the initial one. Although implementing such an inverse key schedule is simple and efficient for the linear key schedule [NS20,NSS20], which is not the case for the AES's non-linear key schedule. We efficiently address the problem by integrating the inverse key schedule into the key array as shown in Figure 6. First, we describe the AES key schedule as wherein c 0 , c 1 , c 2 , and c 3 are 32-bit registers storing each column of an AES key state, and f (·) is the non-linear function composed of RotWord, SubWord, and rcon addition. The idea is to revert Equation 5 with the following procedure: Since the function f is common, we can realize Equation 6 just by adding several XOR gates as shown in Figure 6. The datapath for Equation 6 efficiently fits the key array's horizontal connection, which was unused in the original column-oriented serialization. The key array finishes Equation 6 in 8 cycles (4 cycles for SubWord and another 4 cycles for the XOR between the columns), and the entire inverse key schedule takes 80 cycles. The entire AES operation takes 324 cycles 9 , and the inverse key schedule occupies its 25% fraction. The arrays also integrates the circuits for performing the ×2 8 and η operations as highlighted with red in Figure 6. With this integration we can perform ×2 8 and η in one cycle, otherwise we need 16 cycles and additional 8-bit register. Figure 7 shows the datapath architecture of our AES-LBBB implementation. The dashed line indicates the region for AES, composed of the state and key arrays indicated by (C ST ) and (C K ). We use Canright's design [Can05] for efficiently implementing S-box (C S ).

Circuit Architecture
We use several XOR and AND gates, in addition to the ×2 8 and η operations integrated into (C ST ) and (C K ), to extend the AES implementation to AES-LBBB. The AND gates regulate the data flow by using the control signals from a state machine and realize the operations for each command shown in Figure 5 in a byte-serial manner. It also supports padding on the incoming message, which is indicated by 0x80 in Figure 7. Figure 8 shows the datapath diagram of our Remus-N2-AES implementation that uses the same components, namely (C ST ), (C K ), and (C S ) 10 . To store the Remus-N2-AES's larger state, we use a 128-bit shift register with embedded ×2 8 operation indicated by (C Ext ).

Hardware Performance Evaluation
Procedure. We describe the designs at the register-transfer level except the scan flip-flop essential for the state and key arrays [MPL + 11, NMSS18]: we manually instantiate the Results. Table 4 shows the post-synthesis circuit area in gate equivalent (GE). Each row corresponds to the implementation of AES-LBBB and Remus-N2-AES. For comparison, the table also shows the performance of the baseline AES implementation composed of the same (C ST ), (C K ), and (C S ). The columns represent the total and component-wise performances.
Our AES-LBBB implementation achieved 3,635 GE, which is smaller than that of Remus-N2-AES by 1,200 GE. This advantage mostly comes from the smaller state size: the 128-bit shift register (C Ext ) occupies 1,023 GE for Remus-N2. This confirms along with the conventional works that the register dominates the circuit area. The cost for the mode of operation is 271 GE only compared with 3,364 GE for the baseline AES implementation; we can upgrade an AES coprocessor for supporting AES-LBBB at a very small cost.
Secret Key Storage. As discussed in Section 5.2, our AES-LBBB and Remus-N2-AES implementations overwrite the AEAD's secret key during the operation, in the same way as some previous implementations [GWDE15,IKMP20]. Meanwhile, the other implementations preserve the key so that we can process another message/ciphertext without feeding Table 4: Post-synthesis circuit area of AES-LBBB, Remus-N2-AES, and the baseline AES implementation shown in gate equivalent (GE). the same key again [NSS20,NS20]. To compare our results with those implementations, we need to add the cost for an additional 128-bit register. With the approximation of 7 GE/bit, a 128-bit register uses 896 GE, and our AES-LBBB uses 4,531 GE.

Threshold implementation.
There is a line of research for optimizing the mode of operation for side-channel attack countermeasures [NS20,NSS20] for resource-constrained devices, which have probed the advantage of the low-memory schemes.
The popular countermeasures based on the multi-party computation multiplies the state size by the number of shares, which virtually multiplies the hardware cost. Table 2 also compares the memory sizes of AES-LBBB and Remus-N2-AES with 3-share threshold implementation. In this setting, the AES-LBBB's advantage becomes even larger because Remus-N2-AES's extra state should be doubled: AES-LBBB is smaller by 256 bits or 1,792 GE (for 7 GE/bit).

Design Extension
Recall the design philosophy of ALE such that the sequential execution of KSF is efficient in hardware implementations and the use of the AES round function enables AES-NI to be used, thus is also efficient in software implementations. More explanations for the sequential execution of KSF are as follows. To use the same key in every block-cipher call while keeping the size of the key state minimal, after computing KSF(K) in some block, we need to compute KSF −1 to the value of KSF(K) to reproduce K for the next block, which requires an additional cost for KSF −1 . In Sect. 5, we optimized our AES-LBBB implementation by integrating KSF −1 into the key array (see Figure 6), but the cost is still non-negligible.
In ALE, the key input to the next block is exactly the output of the key schedule function in the current block, thus no additional circuit is required to bypass multiple rounds. We believe that this idea of ALE deserves further investigation, and we call the scheme that does not require the implementation of KSF −1 in this way an "ALE-like mode." Note that the attractive feature of ALE was obtained by giving up security proofs.
This section discusses suitable primitives for LBBB so that the cost of KSF −1 can be reduced. Note that the goal of this section is to make academic progress for achieving an ALE-like mode with provable security, and independent of the goal in the previous sections to provide backward compatibility for AES co-processors. Because we no longer focus on AES, we do not assume the hardware support of cryptographic operations like AES-NI. Hence we aim to propose a scheme that is optimized for hardware implementations.

ALE-like Mode for a Block Cipher with Suitable KSF.
Suppose that there is a dedicated block cipher in which KSF satisfies the property required for π. Then, LBBB becomes an ALE-like mode by using KSF as π. The construction achieves the same efficiency as ALE for the key processing part. The construction needs to implement λ, which is an overhead from ALE, but it enables the construction to achieve almost full-bit security in a provable way.
A lightweight block cipher KATAN [CDK09] is an example of existing designs in which the KSF satisfies the property required for π. KATAN takes an 80-bit key as input. KATAN's KSF consists of a linear feedback shift register (LFSR). Let K i be the i-th bit of the user-provided 80-bit key K and let k i be the i-th bit of the expanded key. K is first set to the 80-bit key state and generates the expanded key bits as follows.
Specifically, 1-clock of KSF is equivalent to the multiplication by 2 with an irreducible polynomial of x 80 + x 61 + x 50 + x 13 + 1. Note that the designers of KATAN chose this irreducible polynomial to have the minimal Hamming weight of 5 (there are no primitive polynomials of degree 80 with only 3 monomials). Hence the design is very suitable for LBBB not only for the property for π but also its design policy.

Block Cipher with Light Key Schedule.
Suppose that there is a block cipher in which KSF and its inverse are very light or essentially negligible, but the property of π cannot be satisfied. A KSF that only applies a permutation of the key bits is an example. Such a block cipher can be used in LBBB efficiently. We use a multiplication by 2 8 for π. After computing each block, we need to compute KSF −1 to reproduce K without using an extra state. The assumption here is that the cost of the inverse KSF is very light or negligible. Thus, the only cost is the multiplication by 2 8 , which is a reasonable trade for obtaining provable security.
Note that there exists several designs that can be used in LBBB for achieving almost 128-bit security. GIFT-128 [BPP + 17] is an example, in which the block size is 128 bits, the key size is 128 bits, and its KSF is only the bit permutation of the key bits.

AES Variant for AES-NI.
Suppose that we still aim to exploit AES-NI to efficiently compute the round function. Unfortunately AES's KSF does not satisfy the property for π. However, only by replacing AES's KSF, AES-NI can still be used for computing the round function. There are several studies that investigate a new KSF for AES [MHM + 02, Nik10, KLPS17]. Among them, Khoo et al. [KLPS17] proposed a KSF that only permutes byte positions. This is exactly the case with "block cipher with light key schedule" explained in the previous paragraph. Hence AES with KSF of Khoo et al. [KLPS17] is efficient for this construction both for hardware and software implementations.

Conclusion
In this paper, we proposed a new mode, LBBB, and its AES instance, AES-LBBB, which provides backward compatibility with AES coprocessors as well as high security and low memory. The core idea of LBBB is to introduce a feed computation from block cipher's output to the key state via a permutation λ. This enabled us to prove BBB security of LBBB, particularly 121-bit security for AES-LBBB. λ is a software friendly multiplication by 2 8 and the main computational cost for each message block is a single execution of AES, thus AES-LBBB can be implemented fast with the AES coprocessors. The state size is a minimum 2n bits, thus AES-LBBB is also low memory in ASIC implementation. We actually implemented AES-LBBB to evaluate its performance for (i) software implementation on a microcontroller with an AES coprocessor and (ii) hardware implementation for ASIC. The results showed that AES-LBBB outperforms the current state-of-the-art Remus-N2 instantiated with AES-128 implemented with the same policy. We also discussed several choices of primitives suitable for LBBB, particularly to point out that a block cipher with an LFSR-based KSF is very suitable for LBBB.
We conclude this paper by mentioning several future directions. The first possible direction is to extend the security proof for a more general class of the key updating function π. Ideally, π should be any function including non-linear update so that π can be any KSF of any cipher, which would remove the necessity of implementing KSF −1 in LBBB. The second direction is to extend the framework so that a key of a larger size than the block size can be covered. This would enable us to support AES-192 and AES-256 in the framework. The last direction we want to mention is the challenge in the standard model. As long as the state size is 2n bits, which is minimal to ensure n-bit security, the ideal-cipher model is required. From a different viewpoint, how small we can go with a standard model is an interesting research direction.