Areion: Highly-Eﬃcient Permutations and Its Applications to Hash Functions for Short Input

. In the real-world applications, the overwhelming majority of cases require hashing with relatively short input, say up to 2K bytes. The length of almost all TCP/IP packets is between 40 to 1.5K bytes, and the maximum packet lengths of major protocols, e.g. , Zigbee, Bluetooth low energy, and Controller Area Network (CAN) are less than 128 bytes. However, existing schemes are not well optimized for short input. To bridge the gap between real-world needs (in future) and limited performances of state-of-the-art hash functions for short input, we design a family of wide-block permutations Areion that fully leverages the power of AES instructions, which are widely deployed in many devices. As its applications, we propose several hash functions. Areion signiﬁcantly outperforms existing schemes for short input and even competitive to relatively long message. Indeed, our hash function is surprisingly fast, and its performance is less than 3 cycles/byte in the latest Intel architecture for any message size. Especially, it is about 10 times faster than existing state-of-the-art schemes for short message up to around 100 bytes, which are most widely-used input size in real-world applications, on both the latest CPU architectures (IceLake, Tiger Lake, and Alder Lake) and mobile platforms (Pixel 6 and iPhone 13).


Background
In real-world communication environments, the overwhelming majority of cases require hashing with relatively short input, say up to 2K bytes. It is common knowledge that "realworld" TCP/IP packet length is biased towards short packets [MKZ + 17], as implemented by the standard benchmark method (Internet Mix 1 and the variants) for Internet routers etc. Packet sizes on the Internet generally follow a bimodal distribution, where 44% of packets are between 40 and 100 bytes long, and 37% are between 1400 and 1500 bytes in size. Low-power wireless protocols employ short packets, e.g., the maximum packet length of Zigbee is 127 bytes and 47 bytes for Bluetooth low energy. The next Controller Area Network (CAN) standard, CAN-FD, has a maximum packet size of 64

Paper Organization
In Sect. 2, we describe the specification of Areion. Sect. 3 explains details of our design rational of Areion and discusses the optimally of our design choices. In Sect. 4, we show several applications of Areion. In Sects. 5 and 6, we give the security and performance evaluations of Areion and its applications, respectively. Sect. 7 concludes the paper.

Specification of Permutations
We show the specification of Areion. Areion is based on Simpira v2 but has the structure that allows more AES instructions to be executed in parallel. We provide the following two variants of our permutation: Areion-256 and Areion-512. The former accepts a 256-bit block, and the latter accepts a 512-bit block as input.
To illustrate the specification of each permutation, we denote by F i (i ∈ {0, 1, 2, 3}) the function based on the operations in AES round function. Let SubBytes, ShiftRows, MixColumns, and AddRoundConstant in the AES round function be SB, SR, M C, and AC, respectively. AC is equivalent to AddRoundKey in ordinal AES, but the constant is added instead of the round key. F i consists of a combination of SB, SR, M C and AC. For each value of i, F i is defined as follows: A combination of AES instructions in AES-NI or NEON can implement these functions. Areion-256 consists of F 1 and F 2 , and Areion-512 consists of F 0 , F 1 , and F 3 . The round function of each variant is shown in Fig. 1.
We set the number of rounds of Areion-256 and Areion-512 are 10 and 15, respectively. These are derived from our security evaluation. Sect. 5 describes the details. The round constants are derived from the binary digits of a fraction part of π = 3.1415926 · · · . Table 1 shows round constants in hexadecimal notation. In the r-th round of Areion, RC r is added to the state.

AES Instructions and SIMD
SIMD is an abbreviation for single instruction multiple data, and a type of parallel processing. Most modern processors support instructions set for SIMD. SIMD instructions perform operations vector-wise using data stored in dedicated registers, which allows arithmetic/bitwise operations in parallel and advanced operations like data shuffling to be performed with a single instruction.
An example of SIMD instructions that can perform complex operations is an instruction for executing AES, which is the dominant block cipher. This instruction belongs to AES-NI (AES New Instructions set) in the Intel/AMD processors. AES-NI includes aesenc to perform the round function of the encryption, aesenclast for the final round, instructions for decryption, and instructions to support the round key generation. On the other hand, in the ARMv8 processors, AES instructions are included in the NEON instructions set. AES instructions in NEON include vaeseq for AddRoundKey, SubBytes, and ShiftRows, and vaesmcq for MixColumns. NEON also supports the decryption instructions, while instructions to assist in the round key generation are not supported.
The performance of SIMD instructions can be measured by their latency, throughput, and port usage. Latency means the number of clock cycles that are required for the execution of an instruction. Throughput means the number of clock cycles required to wait before the responsible ports can accept the same instruction again. Dispatched instructions are decomposed into micro-operations and then processed by each execution port.
According to the website by Abel and Reineke et al. [RTL21], the latency and throughput of aesenc/aesenclast in Ice Lake are 3 and 0.5, respectively. A throughput of 0.5 means that two execution ports can accept the micro-operation from aesenc/aesenclast and each operation's throughput is 1 [RTL21]. Fig. 2 illustrates the pipelined execution of multiple aesenc on Ice Lake. We can see that up to 6 aesenc can be executed in 5 cycles on Ice Lake using two execution ports, port 0 and port 1.  1 Kaby Lake Coffee Lake Cannon Lake 0.5 Ice Lake 3 Tiger Lake Alder Lake Zen + 4 0.5 Zen2

Permutations Realized by only AES Instructions
To construct optimal permutations in environments where hardware instructions of AES are available, we focus on a class of permutations that can be implemented solely by AES instructions such as aesenc and aesenclast in AES-NI or vaeseq and vaesmcq in ARMv8 NEON for the following reasons.
• The latency of AES instructions in AES-NI becomes smaller as the processor's architecture is upgraded. Moreover, Intel 9th generation and later processors have an additional execution port that accepts micro-operations generated from AES instructions, which improves the throughput from 1 to 0.5. The latency and throughput of aesenc in Intel processors from 6 to 11 generation are shown in Table 2.
• Schemes based solely on AES instructions are beneficial in terms of performance and security. Since NIST selected AES as a standard block cipher in 2001, no attack has been published in spite of considerable cryptanalytic efforts over the past 20 years, and its security is deeply understood in the community of symmetric cryptography. Thus, it is easy to evaluate its security by existing tools convincingly and accumulated cryptanalysis knowledge.
• Haraka v2 [KLMR16] is a family of permutations. It is an SPN-type scheme based on AES instructions and word shuffle operations, such as unpack instructions. Shiba et al. show that the structure of Haraka v2 is optimal among SPN-type schemes based solely on AES instructions and shuffle operations [SSI21]. Thus, presenting a new SPN-type scheme with better performance than Haraka v2should be challenging.
• Word shuffle operations provide only simple linear transformations. In contrast, AES instructions include not only more complex linear operations (i.e., MixColumns and ShiftRows) but also nonlinear operations (i.e., 16 parallel executions of 8-bit S-box) by only a single instruction call. In addition, the latency of word shuffle operations requires one, even on the latest CPU architectures. Thus, arguably AES instructions are the most efficient and cryptographically-strong operations in all SIMD instructions.
• Haraka v2 does not provide a sufficient level of security as a hash function according to the recent study by Bao et al. [BDG + 21]. They present preimage attacks on Haraka-256 and Haraka-512 up to 9 out of 10 rounds and 11 out of 10 rounds, respectively. In addition, designers of Haraka v2 did not claim any security as a public permutation. According to these facts, Haraka v2 should require roughly 1.2 to 1.5 times of recommended rounds by the designers, i.e., about 12 to 15 rounds to ensure the security as public permutations of Haraka v2 and hash functions. These additional rounds degrade the performance of Haraka v2 significantly. We remark that, due to the structure of Haraka v2, increasing the number of rounds requires not only more AES instructions but also more word shuffle operations. Thus, it significantly impacts the overall performance of the tweaked versions of Haraka v2 compared to the Feistel-type scheme such as Simpira v2, which is a class of permutations that can be implemented solely by AES instructions.
For the above reasons, we choose the Feistel-type scheme to design new 256-and 512-bit permutations from 128-bit AES instructions.

Feistel-type Scheme for Leveraging the Pipeline
Limitations of Simpira v2. For the 256-and 512-bit variants of Simpira v2 (hereafrer, we will refer to each variant as Simpira-256 and Simpira-512, respectively), there is still room for improvement in their design, considering the characteristic of AES instructions in modern processors, especially for applications that require sequential executions of underlying permutations, e.g., SFIL and VIL hash functions. Specifically, the one-block encryption of Simpira-256 requires two times of executions, and each AES call should be sequential because the second execution requires the output of the first execution. On the other hand, one-block encryption of Simpira-512 is capable of pipelining up to two 2-round AES executions. However, since Intel Ice Lake or later processors can pipeline up to 6 AES instructions, the structure of Simpira-256 and Simpira-512 does not take full advantage of the pipeline.
Pipeline-Friendly Feistel-type Schemes. To take advantage of the pipeline as possible, we design pipeline-friendly Feistel-type schemes in which F functions are added in left branch for 256-bit version and first and third branches for 512-bit version to Feistel-type scheme, respectively, as shown in Fig. 3. These allow for pipelined execution of two and four AES instructions, respectively.
As another possible scheme, we can add F functions in the right branch for the 256-bit version and the second and fourth branches for the 512-bit version to the above schemes before XOR operations, respectively. However, our initial evaluation confirmed that these additional instructions do not improve the performance because they cannot significantly reduce the required number of rounds to ensure the security of structural attacks on Feistel, such as impossible differential and integral attacks. Besides, the critical path in the decryption of this scheme becomes three times longer than that of the encryption. From these facts, we conclude that the schemes in Fig. 1 are optimal for 2-and 4-line Feistel-type schemes for high performance.
Comparison. In order to compare the degree of utilization of the pipeline, we checked instructions per cycle (IPC) of each variant of Areion and Simpira v2 by static code analysis using LLVM machine code analyzer (llvm-mca). Table 3 shows the results. For both variants, the results show the IPC of Areion is larger than that of Simpira v2. Based on this fact, the construction of Areion can utilizes the pipeline more effectively.

Finding Optimal Constructions
Possible Candidates of F Functions. Recall that our permutations are realized solely by AES instructions. As already discussed in [KLMR16,GM16,SSI21], F functions consisting of one or two AES round functions are optimal in Feistel-and SPN-type schemes. In this work, to find further efficient constructions, we also consider last-round instructions such as aesenclast in AES-NI or vaesmcq in ARMv8 NEON, respectively, as underlying instructions. Thus, F functions should be realized by one or two combinations of aesenc and aesenclast in AES-NI or vaeseq and vaesmcq in ARMv8 NEON, respectively. There are six possible candidates of F i (i ∈ {0, 1, 2, 3, 4, 5}), where F 0 , F 1 , F 2 , F 3 are defined in Sect. 2 and F 4 and F 5 are as follows.
For AES-NI, F 0 , F 1 , F 2 , F 3 , F 4 and F 5 are implemented by aesenc, aesenclast, aesenc → aesenc, aesenclast → aesenc, aesenc → aesenclast, and aesenclast → aesenclast, respectively. Note that XOR operations in the Feistel-type scheme are executed by the operation of AddRoundKey, which is the last operation of aesenc and aesenclast, respectively. This feature of AddRoundKey is the reason why AC is absent in the last of these equations.
For ARMv8 NEON, F 0 , F 1 , F 2 , F 3 , F 4 and F 5 are implemented by vaeseq → vaesmcq, vaeseq, vaeseq → vaesmcq → vaeseq → vaesmcq, vaeseq → vaeseq → vaesmcq, vaeseq → vaesmcq → vaeseq, vaeseq → vaeseq, respectively. As vaeseq performs AddRoundKey before SubBytes, the AddRoundKey operation of the first vaeseq in each function is used to realize the XOR operation of the previous round for Feistel-type schemes. This observation implies that our schemes can be implemented solely by vaeseq and vaesmcqin NEON, except for the XOR operation in the last round.
How to Find F functions. To find optimal combinations of functions F i (i ∈ {0, 1, 2, 3, 4, 5}) in Fig. 3, we first evaluate the security against differential/linear, impossible differential, and integral attacks using Mixed-Integer Linear Programming (MILP) for all combinations. Let R 1 , R 2 and R 3 be the number of rounds where the following three conditions are satisfied, respectively.
R 1 : The number of rounds where the minimum number of active S-boxes is enough to ensure the security against differential/linear attacks. R 2 : The number of rounds where there is no any byte-truncated impossible differential characteristic.
R 3 : The number of rounds where there is no any byte-wise integral distinguisher.
Besides, we define max{R 1 , R 2 , R 3 } as R max . After obtaining R max , we will look into characterises for the performance in R max to find most efficient ones. The details are explained in the followings.

On 256-bit Permutations
Let a 256-bit permutation with F α and F β functions be (α, β)-perm, where α, β ∈ {0, 1, 2, 3, 4, 5}, as illustrated in Fig. 3. As a 256-bit permutation has two functions in which there are six possible candidates, the total number of combinations is 36 (= 6 × 6). Among them, we look for combinations implemented by the lowest number of AES instructions in R max , i.e., we choose the ones that can achieve the required security level with the lowest number of AES instructions. Table 4 shows R 1 , R 2 , R 3 , R max and the number of AES instructions in R max of all 36 candidates. According to this table, the lowest one is (2, 1)-perm for which, R 1 , R 2 and R 3 are estimated as 5, 5, and 4, respectively, namely, R max = 5, and #AES instructions in 5 rounds is only 15. From this result, we select (2, 1)-perm as underlying one for Areion-256.
To find the most efficient combination among them, we thoroughly analyze security and performance using the following procedures.
Step 1: Limiting the Number of AES Instructions in R max . As with the 256-bit case, we focus on combinations implemented by the lowest number of AES instructions in R max . As a result of our search, we find 30 candidates in which the lowest number of AES instructions in R max (= 9) is 45, as shown in Table 5.
Step 2: Eliminating the Equivalent Candidates. 28 candidates out of the remaining 30 ones can be classified into 14 equivalent classes, i.e., each two candidates of them is  (4,4) 7 mapped to one equivalent class. Based on this fact, we can eliminate 14 equivalent classes, and then we can reduce to 16 (= 30 − 14) candidates.
Step 3: Considering Efficiency in NEON Instructions. The remaining 16 combinations can be classified into three different classes. Specifically, each different class has the following different π: The two constructions in π 3 are unsuitable for implementations using NEON instructions in ARMv8. This is due to the fact that the implementation of these two constructions by NEON requires successive XORs, which hampers the implementation with only vaeseq and vaesmcq while maintaining the compatibility of the implementation on ARM and Intel. Based on this fact, we eliminate these two constructions using π 3 from the candidates. As a result, we obtain 14 constructions.
Step 4: Estimating Theoretical Number of Cycles. For the remaining 14 candidates, we use a performance analysis tool llvm-mca to estimate the theoretical number of cycles in Ice Lake or later architecture. Table 6 shows theoretical values of total cycles in 15-round encryption, calculated by llvm-mca. According to this result, we reduce to 6 candidates with the lowest number of cycles to perform the encryption.

SFIL Hash Function
For an SFIL hashing, we apply Areion to the Davies-Meyer (DM) construction, which consists of a permutation with a feed-forward (applying the XOR operation) of the input. The use of DM for SFIL hashing has already been discussed in [GM16,KLMR16] where π 256 and π 512 are the 256-and 512-bit permutations of Haraka v2, respectively; and trunc : F 512 2 → F 256 2 is a truncation function defined as follows: trunc(x 0 || · · · ||x 15 ) = x 2 ||x 3 ||x 6 ||x 7 ||x 8 ||x 9 ||x 12 ||x 13 , where x = x 0 || · · · ||x 15 ∈ F 512 2 . Our SFIL hash functions, Areion256-DM and Areion512-DM, use Areion-256 and Areion-512 instead of Haraka v2's ones. The DM construction uses only the forward direction of the permutation, and the overhead beyond the permutation is negligible. Thus, the performances of Areion256-DM and Areion512-DM are effectively the same as those of the forward direction of underlying permutations.
The designers of Simpira v2 suggested its application to SFIL hash functions [GM16]. Then, for performance comparison, we define DM construction instantiations of Simpira v2 in the same way as above and refer to them as Simpira256-DM and Simpira512-DM.

VIL Hash Function
For a VIL hashing, we apply Areion-512 to the Merkle-Damgård (MD) construction, a classical method of building a cryptographic hash function from a compression function [Mer89,Dam89].
Our VIL hash function, Areion512-MD, is an MD construction instantiated with Areion512-DM. Other design details of Areion512-MD follow SHA2-256 [oST15a]. SHA2-256 has two phases, preprocessing and hash computation phases. The former is further divided into three steps: padding the message, parsing the message into message blocks and setting the initial hash value. For Areion512-MD, padding and message parsing are executed in the same procedure as SHA2-256. However, the length of the padded message should be adjusted to be a multiple of 256 bits instead of a multiple of 512 bits; and the size of the parsed message block is 256 bits (see [oST15a,Section 5] for more details). Areion512-MD uses the same initial hash value H of SHA2-256, and it consists of the following two 128-bit words: Then, Areion512-DM is used for the hash computation phase. The parsed message block is inserted into x 0 and x 1 of the input word positions in Areion-512, and the initial hash value and chaining values (that is, the output value of each compression function) are set into x 2 and x 3 of the input word positions in Areion-512 (see Fig. 1b). Finally, the output value of the last DM compression function becomes a 256-bit message digest. The designers of Simpira v2 and Haraka v2 did not mention its application to VIL hash functions [GM16,KLMR16]. However, for performance comparison, we define an MD construction instantiation of Simpira512-DM and Haraka512-DM in the same way as above and refer to them as Simpira512-MD and Haraka512-MD, respectively.

Security for Underlying Permutations
We evaluate the security of Areion-256 and Areion-512 as public permutations against differential, linear, impossible differential, and integral attacks.
Claimed Security for Underlying Permutations. We claim 128-bit security for both Areion-256 and Areion-512 as with Simpira v2, i.e., we consider the attacks up to 2 128 complexity. There is no rigorous definition of a distinguisher for a public permutation. In the literature, there is, however, a very related concept called the known-key distinguisher [KR07] or the correlation intractability [CGH04]. Note that once the key is known for a block cipher, the block cipher becomes a public permutation. Roughly speaking, a known-key distinguisher is that for a block cipher, there exists a relation such that given the key, it is easy to find plaintext-ciphertext pairs satisfying this relation. However, it is difficult to find them for a random permutation [KR07]. Moreover, if the relationship is simply the description of the block cipher itself, this should be meaningless for the following reasons. First, every block cipher will be vulnerable to this attack with only 1 query. Second, the relationship is not interesting at all from the designers' perspective [KR07]. In [Gil14], a more formal definition of the known-key distinguisher for a block cipher was given, which is a rigorous description of the above statement. In both the known-key distinguishers on AES [KR07,Gil14], they indeed are the extensions of the well-known integral attack on round-reduced AES, where the attackers start from a middle round and aim to find an input-output set such that the sum of some bytes in the input and output are all zero, respectively. We will rely on similar start-from-the-middle techniques to construct zero-sum distinguishers for our proposed public permutations. Moreover, our zero-sum distinguishers also resemble the known-key distinguishers on AES [KR07,Gil14] because we similarly find distinguishers based on the well-known integral attack on AES.
Differential/Linear Attacks. We estimate the security against differential/linear attacks by obtaining the lower bound for the number of differentially/linearly active S-boxes with an MILP-based method proposed by Mouha et al. [MWGP11]. AS D and AS L denote the lower bound for the number of differentially and linearly active S-boxes, respectively. Since the maximal differential and linear probability of the S-box of AES are both 2 −6 , AS D/L of ≥ 22 (2 −6×22 < 2 −128 ) is sufficient to ensure 128-bit security against differential/linear attacks. Table 7 shows the lower bound for the number of differentially/linearly active S-boxes for Areion-256 and Areion-512. In our evaluation, Areion-256/Areion-512 achieves both AS D and AS L of ≥ 22 at 4/6 rounds, and both AS D and AS L at 12 rounds for both permutations outnumber well over 22. Therefore, we expect full rounds of Areion-256 and Areion-512 can resist differential and linear attacks.
Impossible Differential Attacks. The miss-in-the-middle approach is known as an efficient way to find the longest impossible differences, which can be implemented by an MILP with a small change from an MILP model for counting the number of differentially active S-boxes [ST17,CJF + 16]. In our evaluation, we search a class of impossible differential characteristics where input and output differences activate only one byte to find the longest impossible differences efficiently.
By this approach, we find the impossible differences at 4/8 rounds of Areion-256/Areion-512, both of which are the longest ones we can find. Since there is still enough margin to full rounds for both permutations, we expect that full rounds of Areion-256 and Areion-512 can resist impossible differential attacks.
Integral Attacks. To find the integral distinguisher, we evaluate the byte-wise division property with a MILP-based method proposed by Xiang et al. [XZBL16]. We search the input space where only one byte is constant, and the remaining bytes are active, i.e., the data/time complexity of the integral distinguishers are 2 248 and 2 504 for Areion-256 and Areion-512, respectively.
As a result, we find the 3-and 5-round integral distinguisher on Areion-256 and Areion-512, respectively. It should be emphasized that the required data/time complexities for these distinguishers exceed our security claim. Hence, the longest integral distinguishers with up to 2 128 data/time complexity, which are in our security claim, are expected to exist on fewer rounds than that of these distinguishers. Thus, we expect full rounds of Areion-256 and Areion-512 can resist integral attacks.
Zero-sum Distinguishers. The zero-sum distinguisher [AM09] is a popular attack on public permutations. The overall attack procedure is straightforward. Specifically, the attackers first choose a particular set of intermediate state values and then propagate this set of values backwards and forwards, respectively. If, in the corresponding set of inputs and outputs, the sum of some input bits and output bits are zero, respectively, a zero-sum distinguisher is found. We have evaluated the resistance against this attack based on the well-known 4-round integral distinguisher for AES. It is found that there are zero-sum distinguishers for 5-round Areion-256 and 10-round Areion-512, respectively. The data and time complexity of the 2 zero-sum distinguishers are the same, which are both 2 32 . We give the details below.
The distinguisher for 5-round Areion-256. First, we explain the zero-sum distinguisher for 5-round Areion-256, as shown in Fig. 4. Specifically, we choose 4 bytes of x 2 1 which traverses all the 2 32 possible values. For x 2 0 , it is assigned to a random constant value. According to the round function, we have For the term G 1 • G 0 (x 4 0 ) in x 5 0 , with our input form for (x 2 0 , x 2 1 ), it is equivalent to that x 2 1 passes 4 AES rounds. For the term x 4 1 in x 5 0 , it is equivalent to that x 2 1 passes 2 AES rounds. Hence, we need to use a data set of size 2 32 , and all the bytes in x 5 0 will be balanced. For x 5 1 , as x 4 0 can be viewed as applying 2 AES rounds to x 2 1 , each byte of x 5 1 will also be balanced. The above observation also explains why the automatic method based on the division property could only detect a 3-round integral distinguisher in the forward direction, i.e., we at least need to consider 4 AES rounds.
In the backward direction, we have Therefore, all the bytes in x 0 0 will be balanced. To better understand this, one can first consider the case when only one byte of x 2 1 traverses all the 2 8 possible values. For such a case, it can be easily checked that each byte in x 0 0 will also traverse all the 2 8 possible values. Hence, if one diagonal of x 2 1 takes all the possible 2 32 values, all bytes in x 0 0 are also balanced.
The distinguisher for 10-round Areion-256. Next, we explain the zero-sum distinguisher for 10-round Areion-512, as shown in Fig. 5. We start from the state (x 4 0 , x 4 1 , x 4 2 , x 4 3 ) after 4 rounds of permutation. For the input form, we restrict that 4 bytes of x 4 0 will traverse all the 2 32 possible values, as shown in Fig. 5. Then, we randomly choose a 128-bit constant C such that F 0 (x 4 0 ) ⊕ x 4 1 = C always holds. In other words, the value of x 4 1 is conditioned, and it is dynamically chosen according to x 4 0 . For x 4 2 , we assign a random constant value to it. For x 4 3 , we also assign a random constant value C to it but we require that the first column of x 3 0 = F −1 1 (C ) is all 0. Note that (F 0 , F 1 , F 2 , F 3 ) are defined in Sect. 2. For such an input state, in the forward direction, we can trivially deduce that (x 6 0 , x 6 1 , x 6 3 ) are constants and one diagonal of x 6 2 will take all the 2 32 possible values. Therefore, we can also deduce that (x 7 0 , x 7 3 , x 8 3 ) are all constants. Since , we can rewrite x 10 2 as follows where C i are 128-bit constants: 2 ) ⊕ C 3 ) ⊕ C 0 ) ⊕ C 1 ) is equivalent to applying 4 AES rounds to x 6 2 . The term F 1 (F 3 (x 6 2 ) ⊕ C 2 ) is equivalent to applying 2.5 AES rounds to x 6 2 . This implies that we need to use a data set of size 2 32 to detect an integral property at x 10 2 . Since one diagonal of x 6 2 takes all the 2 32 possible values, each byte in x 10 2 is balanced. For (x 10 0 , x 10 1 , x 10 3 ), we will lose the zero-sum property, and this can be deduced similarly. In other words, we can obtain a 6-round integral distinguisher with data complexity 2 32 in the forward direction, which is one more round than the result obtained with the automatic method based on the division property. The main reason is that we dynamically choose values for x 4 1 such that F 0 (x 4 0 ) ⊕ x 4 1 is always a constant when x 4 0 varies. In the backward direction, we consider a subset of (x 4 0 , x 4 1 , x 4 2 , x 4 3 ). Specifically, we consider the case when the first byte x 4 0 takes all the 2 8 possible values. In this case, the value of the first column of x 4 1 is dynamically chosen such that F 0 (x 4 0 ) ⊕ x 4 1 is a constant C, as shown in Fig. 5. Then, we have 2 24 such subsets in total. Since . Since F 0 = M C •SR•SB, the above formula implies that the first byte of AC •SR•SB(x 3 2 ) will traverse all the 2 8 possible values. Hence, only the first byte of x 3 2 will traverse all the 2 8 possible values. Therefore, we obtain the form of (x 3 0 , x 3 1 , x 3 2 , x 3 3 ) shown in Fig. 5.

Security for Hash Functions
Claimed Security for Hash Functions. We claim 256-bit security against the preimage attack for both Areion256-DM and Areion512-DM. However, as in Haraka v2 and the SFIL hash function built on Simpira v2, we do not claim their resistances against the collision attack since it is unnecessary for their applications.
For the MD-based hash function, we claim 256-bit security against the preimage attack and 128-bit security against the collision attack, the same as SHA2-256. Due to the generic second-preimage attack on the MD construction [KS05], our MD-based hash scheme could only provide about 193-bit security for second-preimage attacks. This limitation is because the maximal number of allowed message blocks is 2 64 and 193 = 256 − 64 + 1, which is the same security level as SHA2-256.

Meet-in-the-Middle Preimage Attack. For the DM-based SFIL hash functions by using
Areion-256 and Areion-512 as the underlying permutations, respectively, it is necessary to take into account Sasaki's meet-in-the-middle (MITM) preimage attack [Sas11]. This attack is the most powerful preimage attack on such hash functions. Indeed, the designers of Haraka v2 have evaluated its resistance against this attack in a dedicated way. To better understand the security of our constructions, we also performed a careful analysis. We found preimage attacks on 5-round Areion256-DM and 10-round Areion512-DM, respectively. Therefore, there is still a sufficiently large security margin. We detail our analysis below.
To save space, we only describe the general procedure of Sasaki's meet-in-the-middle preimage attack, as shown below: Step 1: Identify the bytes that are fixed to constants and assign proper values to them.
Step 2: Identify the bytes that are to be exhausted. Classify them into backward neutral bytes and forward neutral bytes.
Step 3: In the forward direction, we assume that the backward neutral bytes are unknown and compute the internal state values based on the constant bytes and the forward neutral bytes. In other words, we only compute the bytes that can be computed from the knowledge of the constant bytes and the forward neutral bytes. This step is repeated for all the possible values of the forward neutral bytes, and we store the corresponding computed information.
Step 4: In the backward direction, we assume that the forward neutral bytes and unknown, and we only compute the bytes that can be computed from the knowledge of the constant bytes and the backward neutral bytes. This step is repeated for all the possible values of the backward neutral bytes, and we store the corresponding computed information 3 .
Step 5: Find matches between the store information obtained at Step 3 and Step 4. Suppose the matching probability is 2 −p and there are 2 b f and 2 b b possible values for the forward neutral bytes and the backward neutral bytes, respectively. Moreover, for each obtained state information at Step 3, if it is possible to identify the matched information obtained at Step 4 with time complexity 1, or vice versa, we can say that we find 2 b f +b b −p possible pairs among the 2 b f +b b pairs with time complexity In other words, we exhaust 2 b f +b b possible candidates only with time complexity max(2 b f , 2 b b ). Hence, the MITM preimage attack is min(2 b b , 2 b f ) times faster than the brute force. Hence, this attack aims to identify the forward and backward neutral bytes as well as an efficient matching method. For the two short-input hash schemes, we performed careful analysis and found preimage attacks on 5-round Areion256-DM and 10-round Areion512-DM, respectively. In the two attacks, b b = b f = 8 and the matching phase can be efficiently finished with time complexity 1. Hence, both the preimage attacks are 2 8 times faster than the brute force. The corresponding illustration of the two preimage attacks can be referred to Figs. 6 and 7, respectively. Collision Attacks. The most powerful collision attack on AES-based hash functions is the rebound attack [MRST09], especially when it is built on the DM construction, as the attacker can fully control the whole internal state. However, as already mentioned in Haraka v2 and the SFIL hash function based on Simpira v2, the collision resistance of SFIL hash schemes is not necessary when they are used in the signature scheme, which is also the case of our SFIL hash functions. Security of MD Construction. For our hash scheme built on the MD construction, the attacker will soon lose the capability to fully control the internal state since each message block is only 256 bits, i.e., half of the state size. However, by using j > 1 message blocks, Sasaki's MITM attack can still be applied in the same way as in the attack on the DM constructions. Specifically, although the 256-bit initial value set at (x 2 , x 3 ) in the first input state is fixed, the attackers can view the 256-bit chaining variable (CV) in the last input state as a controllable part. Then, Sasaki's MITM attack is applied, and we aim to find 2 i solutions of the last input state to match the given hash value in less than 2 256 time. This way, 2 i candidates of CV in the last input state can be obtained. Finally, we randomly pick values for the first j − 1 message blocks to compute the corresponding CV for the last input state and expect that one such CV can match one of the 2 i candidates obtained by the MITM attack. Hence, we need to try 2 256−i different values for the first j − 1 message blocks, and the time complexity is below 2 256 .
For the collision resistance, we consider the rebound attack, the most efficient technique for AES-based hash functions. In particular, the most powerful rebound attack is always based on the Super-Sbox technique [GP10,LMR + 09]. For such a technique, the attacker can control the difference transitions over two consecutive AES rounds with a pre-computation phase called the inbound phase, as shown in Fig. 8. Combined with the feature of the rebound attack, this technique allows the attacker to ignore the influence of 4+16+16+4 = 38 active S-boxes by using 128 free bits. Since the size of one message block in our VIL hash function is 256 bits, we expect that the attacker can ignore 38 × 2 = 76 active S-boxes with the Super-Sbox technique. However, we emphasize that it does not necessarily imply that the attacker can always ignore 76 active S-boxes in the actual attack because the rebound attack is also a start-from-the-middle-style attack, and one should be careful of the consistency in the CV.
According to Table 7, the minimal number of active S-boxes in 11-round Areion-512 is 119. By ignoring 76 active S-boxes, there are still 119 − 76 = 43 active S-boxes left. In the outbound phase, we usually need to cancel the truncated differences. In the best case, we only need to consider half of the left active S-boxes, i.e., we know the propagation of the truncated differences, and we only add conditions on the sum of the two truncated differences, as shown in Fig. 9. Even if we only consider 43/2 ≈ 21 active S-boxes, they still correspond to a very low uncontrolled probability of 2 −21×8 = 2 −168 . Note that we have not yet taken into account the extra conditions on the truncated input and output differences to generate a collision. If they are considered, the truncated differential may be worse (i.e., there are more active S-boxes), and the uncontrolled probability may further decrease. Hence, we believe the VIL hash function based on the 15-round Areion-512 is secure against the collision attack. We also note that there is a variant method [JNP12] of the 2-round Super-Sbox technique that can cover three consecutive AES rounds, which can allow the attackers to ignore the influence of 4+16+16+16+4 = 54 active S-boxes. However, this technique does not come for free. Specifically, different from the 2-round Super-Sbox technique to satisfy 4 + 16 + 16 + 4 = 38 active S-boxes where lots of degrees of freedom are left after this phase, there is no degrees of freedom left after performing such a 3-round Super-Sbox technique and finding a solution to satisfy these 54 active S-boxes succeeds with probability 2 −64 . In other words, it is like 2-round Super-Sbox technique with satisfying extra 16 active S-boxes with a probability of only 2 −64 , which is a huge improvement over the 2-round Super-Sbox technique. We also note that it is almost equivalent to our conservative estimation that we only need to consider half of the remaining active S-boxes at the outbound phase when using the 2-round Super-Sbox technique for the inbound phase.

Performance Evaluation
In this section, we evaluate the performance of both Areion and its applications to the permutation-based hash functions described in Sect. 4. To this end, we used the available source code at GitHub 4 to evaluate the cycle counters, i.e., cycles per byte (cpb), in the target primitive. All our evaluations were performed on the following widely deployed platforms: the Ice Lake, Tiger Lake, and Alder Lake platforms. More precisely, the Ice Lake platform has an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz. The Tiger Lake platform has an Intel(R) Core(TM) i7-1165G7 CPU @ 2.80GHz. The Alder Lake platform has an Intel(R) Core(TM) i9-12900K CPU @ 3.20GHz on a performance-core (P-core) and 2.40GHz on an efficient-core (E-core). Turbo Boost technology has been switched off for all our evaluations. We note here that the P-core has been specified for our evaluations on the Alder Lake platform because there is almost no difference in the benchmarks between using either the P-core or E-core.
Besides, we also evaluate the performance of NEON implementations of permutationbased hash functions proposed in Sect. 4 in several mobile environments. To keep the page limit, the NEON implementations of Areion is shown in Appendix A.3.

Underlying Permutations
We first evaluate the performance of the underlying permutations, i.e., Areion-256 and Areion-512. These implementations are given in Appendix A.1. For comparison, we used the underlying permutations of Simpira v2, Haraka v2, and the 512-bit permutation BLAKE2s.  We can find the source codes of Haraka v2 and BLAKE2s available at GitHub 5,6 , but we could not find the available source code for Simpira v2. For this reason, we implemented it as described in Appendix A.2. According to [GM16,KLMR16], Simpira v2 and Haraka v2 are supposed to operate on multiple message blocks, not just a single message block, to get the highest performance. Based on this concept, we also evaluate the performance when operating on eight message blocks in parallel as well as a single message block. Tables 8 and 9 show benchmarks for single and parallel encryption/decryption on our platforms. From these tables, Haraka v2 appears to be the fastest encryption, but it cannot be regarded to have a security margin sufficiently, as discussed in Sect. 3.2.1. For this reason, we consider there is no problem even if Haraka v2 is excluded from our comparison. Instead of the original Haraka v2, we use the 12/15-round variants of Haraka v2, Haraka-256 (x1.2/x1.5) and Haraka-512 (x1.2/x1.5), for our comparison. This is because DM-based instantiations of the tweaked variants, Haraka256-DM (x1.2/x1.5) and Haraka512-DM (x1.2/x1.5), can be regarded to have a similar security level as Areion256-DM and Areion512-DM. Indeed, the security margins against MITM preimage attacks of Areion256-DM, Haraka256-DM (x1.2), and Haraka256-DM (x1.5) are 5, 3, and 6, respectively. Similarly, the security margins of Areion512-DM, Haraka512-DM (x1.2), and Haraka512-DM (x1.5) are 5, 1, and 4, respectively. We summarize the performance comparison for the underlying permutations as follows: • Areion-256 realizes the fastest encryption among the target permutations, excluding Haraka-256 (x1.2) for single block encryption (although there are almost no differences in performance). Specifically, Areion-256 performs at least 1.52 and 1.20 times faster than Simpira-256 and Haraka-256 (x1.5) for single block encryption, respectively, and at least 1.12 and 1.03 times faster than Simpira-256 and Haraka-256 (x1.2) for parallel block encryption, respectively. On the other hand, for single and parallel block decryptions, Areion-256 performs faster than Haraka-256 (x1.2/x1.5), but there are almost no differences in performance between Areion-256 and Simpira-256.
Given that the Areion-512 decryption function is not used for the proposed applications of Areion described in Sect. 4, we consider that there is no problem even if Areion-512 performs slower than Simpira-512 for decryption. Therefore, Areion has the strongest advantage of performing faster than any other target permutations, especially in terms of encryption direction.
Regarding the advantage of Areion-256 over Areion-512, Table 9 suggests that Areion-256 is consistently faster than Areion-512 for parallel processing and even the fastest among all the selected 256-/512-bit permutations in many cases. In addition, it has a balanced performance for encryption and decryption thanks to its Feistel-like structure, unlike Haraka-256, and faster than the Feistel-based Simpira-256. That is, it should work more efficiently with the existing parallelizable permutation-based authenticated encryption modes, e.g., OPP [GJMN16] and a permutation-based counterpart of OTR [Min14] than other permutations. The latter would be similar to Prøst-OTR [KLL + 14] adopting the masking scheme of OPP for provable security and for avoiding the attack specific to (the masking scheme of) Prøst-OTR [DEM15]. Its applications to the parallel authenticated encryption modes are left as our future work. On the other hand, Areion-512 is the fastest among the selected 256-/512-bit permutations for single block encryption direction (Table 8). That is, it should work more efficiently with the existing permutation-based compression functions such as DM construction, and the existing sequential hash functions such as MD construction. These are the target applications for this study.

Permutation-based Hash Functions
Next, we evaluate the performance of the permutation-based hash functions, i.e., the SFIL and VIL hash functions (DM and MD constructions). These instantiations of Areion are implemented based on the source codes of Areion-256 and Areion-512 described in Appendix A.1. For comparison regarding the SFIL hash functions, we used DM constructions instantiated with Simpira v2 and Haraka v2. On the other hand, for comparison regarding the VIL hash functions, we used AES-based VIL hash functions, such as Simpira512-MD, Haraka512-MD, and double-block-length hash functions proposed by Hirose at FSE 2006 [Hir06]. We refer to the Hirose's hash function as Hirose-DBL. These instantiations are also implemented similarly to those of Areion. In addition, we used SHA2-256, SHA3-256, ParallelHash256, KangarooTwelve, and BLAKE3. We can find these source codes available at GitHub 7,8,9 ; then, we modified these source codes to use for our comparison. Tables 10 and 11 show benchmarks for the SFIL and VIL hash functions on our platforms. From Table 10, Haraka512-DM appears to be the fastest SFIL hash function, but Haraka v2 cannot be regarded to have the security margin sufficiently; thus, we use Haraka256-DM (x1.2/x1.5) and Haraka512-DM (x1.2/x1.5) for our comparison regarding the SFIL hash functions, as discussed in Sect. 6.1. Similarly, we use Haraka512-MD (x1.2/x1.5) for our comparison regarding the VIL hash functions. We summarize the performance comparison for the SFIL hash functions as follows: • Areion256-DM realizes the fastest SFIL hashing among the target DM constructions with the 256-bit permutation, excluding Haraka256-DM (x1.2) (although there are almost no differences in performance). Specifically, Areion256-DM performs at least 1.41 and 1.21 times faster than Simpira256-DM and Haraka256-DM (x1.5), respectively.
Consequently, It can be considered that Areion256-DM and Areion512-DM are the fastest SFIL hash functions. On the other hand, we summarize the performance comparison for the VIL hash functions as follows: • Areion512-MD realizes the fastest VIL hashing among the target hash functions with a 256-bit security level for input sizes up to around 4K bytes. Specifically, its performance is less than 3 cpb for any message size. Moreover, it is about 10 times faster than existing state-of-the-art schemes (e.g., SHA2-256, SHA3-256, and ParallelHash256) for short messages up to around 100 bytes, widely-used input size in real-world applications.
Considering the need for cryptographic primitives resistant to symmetric-key cryptanalysis based on quantum algorithms (e.g., Grover's algorithm [Gro96]), hash functions with a 256-bit security level must be required for the future. For this reason, we consider that there is no problem even if Areion512-MD performs slower than KangarooTwelve when the input size is 2K bytes or more. In addition, according to the current study on packet sizes on the Internet [MKZ + 17], it is known that around 44% of packets are between 40 and 100 bytes long and 37% are between 1400 and 1500 bytes in size. Given that most of Table 11: Benchmarks for VIL hash functions on the Ice Lake, Tiger Lake, and Alder Lake platforms. All values are given as cpb. Our hash function is surprisingly fast. Its performance is less than 3 cycle/byte in the latest Intel architectures for any message size. It is about 10 times faster than existing schemes for short messages up to around 100 bytes, which are the most widely-used input size in real-world applications, on both of on latest CPU architectures (IceLake, Tiger Lake, and Alder Lake) and mobile environments (Pixel 6 and iPhone 13).