Composite Enclaves: Towards Disaggregated Trusted Execution

. The ever-rising computation demand is forcing the move from the CPU to heterogeneous specialized hardware, which is readily available across modern data-centers through disaggregated infrastructure. On the other hand, trusted execution environments (TEEs), one of the most promising recent developments in hardware security, can only protect code conﬁned in the CPU, limiting TEEs’ potential and applicability to a handful of applications. We observe that the TEEs’ hardware trusted computing base (TCB) is ﬁxed at design time, which in practice leads to using untrusted software to employ peripherals in TEEs. Based on this observation, we propose composite enclaves with a conﬁgurable hardware and software TCB, allowing enclaves access to multiple computing and IO resources. Finally, we present two case studies of composite enclaves: i) an FPGA platform based on RISC-V Keystone connected to emulated peripherals and sensors, and ii) a large-scale accelerator. These case studies showcase a ﬂexible but small TCB (2.5 KLoC for IO peripherals and drivers), with a low-performance overhead (only around 220 additional cycles for a context switch), thus demonstrating the feasibility of our approach and showing that it can work with a wide range of specialized hardware.


Introduction
For most of the computer's history, designing an architecture around the CPU allowed extracting the most performance benefits from Moore's law. Nowadays, however, the demand for increased computation power is usually met with special-purpose hardware: GPUs are often orders of magnitude more efficient than a CPU for parallel workloads such as graphics and machine learning, and FPGAs often achieve similar gains for custom workloads. Some tasks such as machine learning are even pervasive enough to justify the investment into fully custom ASICs [JYP + 17]. In these modern platform architectures, the CPU's main job is to move data to relevant specialized hardware [JBC + 15], collecting the results, and then possibly feeding them to yet another device. Effectively, the CPU's primary role is shifting towards a mere coordinator of available specialized hardware. Cloud computing architectures are even adopting a disaggregated model called composable disaggregated infrastructure (CDI) [KSP + 16, LCM + 09, NTT + 18] in which data centers no longer consist of a number of connected servers, but of functional blocks connected with high-speed interconnects. Each block provides a pool of a particular resource, be it GPUs, CPUs, memory, storage, or FPGAs, to allow for fine-grained resource allocation 1. We extend traditional TEEs with a configurable hardware TCB, i.e., the enclave's TCB only includes the driver, and firmware of the used specialized hardware. We call these new enclaves composite enclaves. We identify two new properties that are relevant for these systems, a more comprehensive attestation for composite enclaves, and platform awareness. Additionally, we propose a software design that abstracts the underlying hardware layer to ease the integration with the existing application and driver ecosystem.
2. We analyze the security aspects of our approach in detail. This includes the security implications of our design decisions and a number of relevant side-channels.
3. We demonstrate two case studies: first, we present an end-to-end prototype based on Keystone [LKS + 20] on an FPGA running a RISC-V processor [ZB19] connected to multiple external peripherals (IO devices, sensors, etc.) emulated by an Arduino microcontroller. Our modifications to the software TCB of Keystone only amount to around 600 LoC. Second, we perform a case study based on a GPU-style accelerator [ZSB21] and integrate it within a composite enclave while also supporting multi-tenant isolation.

Keystone
Keystone [LKS + 20] is a TEE framework based on RISC-V similar to existing TEE designs such as Intel SGX [CD16] and Sanctum [CLD16]. However, in contrast to these systems which leverage the MMU to isolate memory, Keystone isolates phyiscal memory using physical memory protection (PMP) to provide isolation. PMP is specified in the RISC-V privilege standard [WLA + 19] and its entries allow to configure access policies that can individually allow or deny reading, writing, and executing for a memory range. For instance, a PMP entry can be used to restrict the operating system (OS) from accessing the memory of the bootloader. Every access request to a prohibited range gets trapped precisely in the core and results in a hardware exception. In Keystone, the PMP entries are managed by the security monitor (SM) which runs in the highest privileged mode called m-mode. The untrusted OS runs in the supervisor mode (s-mode), whereas ordinary applications run in the least privileged user mode (u-mode). Isolated enclaves run in their own separate s and u-mode in parallel to the OS. The SM maintains its own memory separate from the OS and protected by a PMP entry. It facilitates all enclave calls, e.g., it creates, runs, and destroys enclaves. The SM configures the PMP entries so that the OS can no longer access the enclave's private memory. Upon a context switch, the SM re-configures the PMP to allow or block access to the enclave. For instance, during a context switch from an enclave to the OS, the SM changes the PMP configuration such that access to the enclave memory is prohibited. Conversely, on a context switch back to the enclave, the PMP gets reconfigured to allow accesses to enclave memory. Since the SM is critical for the security of any enclave and the whole system, it aims to be very minimal and lean. As such, the SM is orders of magnitudes smaller than hypervisors and operating systems (15k LoC vs millions LoC [T + 21, BDF + 03]). There are also efforts to create formal proofs for such a SM [LHD + 19]. Keystone also provides extensions for cache side-channel protections using page coloring or dynamic enclave memory.

Device Tree
The device tree is a list that accurately describes the physical memory mappings of a platform. It describes the central processor, i.e., its speed, its ISA, and at what address its cache starts. It also includes the DRAM base address and various other components on the die, such as various internal and external buses. It is usually used by the bootloader and the OS to bootstrap the system. As some peripherals cannot be detected automatically, they must be present in the device tree, as otherwise they will not get recognized by the OS. The device tree is usually burnt into ROM and available to the bootloader and the OS. It can therefore be considered trusted.

Problem Statement
Modern platforms are composed of (disaggregated) heterogeneous devices, from simple sensors that measure temperature or humidity to complex accelerators for machine learning. We summarize all of these devices under the term specialized hardware in this paper. Many modern workloads are critically dependent on such specialized hardware and often handle sensitive data, e.g., patient records for machine learning. However, existing solutions contain severe limitations for such applications.

Existing Solutions
There are several existing solutions for applications that handle sensitive data while also leveraging specialized hardware. For example, a fully dedicated system could, of course, support such an application, but it would incur high costs and very poor flexibility. On the other hand, the application could be executed on an ordinary operating system or even in a virtual machine. However, both of these approaches rely on substantial codebases with millions of lines of code

Alternative Approaches
In this paper, we investigate an approach based on TEEs. However, approaches based on microkernels, such as seL4 [KEH + 09], are also promising. We want to stress that both of these options would require significant changes, and both are bound to encounter challenges along the way. Microkernels have the advantage of already supporting applications that leverage specialized hardware, but, in turn, they do not support attestation. TEEs, on the other hand, support many desired properties out of the box but lack integration with specialized hardware. The TCB of both approaches would probably be comparable with only a slight difference because microkernels include the scheduler in their TCB. Nevertheless, we believe both directions to be promising, but we focus on TEEs in this work. Further discussion on a potential approach based on a microkernel can be found in Section 10.

Attacker Model
The attacker model is tightly coupled with the type of specialized hardware. We separate the specialized hardware into two classes due to their distinct effect on the attacker model: Specialized hardware with physical interaction: Such devices range from input-only, such as input peripherals (e.g., mouse, keyboard) and sensors (e.g., temperature sensor) to output-only devices (e.g., monitor) and combined IO devices (e.g., touchscreen). For any such device, a local physical adversary can manipulate the environment and thus the input (and potentially the output). E.g., a physical adversary can point a laser at a light sensor, thus changing the sensor's reading but not the room's overall light intensity. Hence, any specialized hardware that interacts with its physical environment cannot tolerate a physical adversary.
Specialized hardware without physical interaction: There are specialized hardware units that do not explicitly interact with their environment. They draw power and produce heat, but their input and output are not related to the environment. GPUs and other accelerators are the prime examples of this class of specialized hardware, for whom a local physical adversary can be tolerated.
In this paper, we assume a remote attacker who remotely controls the entire software stack, i.e., the OS and hypervisor. While the remote attacker model is a weaker assumption compared to the local physical attacker considered in the existing TEEs, the former covers both aforementioned classes of specialized hardware. In the remote model, the attacker cannot access the platform physically or hot-swap a specialized hardware device. Note that the untrusted OS is still in charge of managing specialized hardware, and thus is able to remap the devices or send a reset or power-off signal. In addition, an adversary may launch DMA attacks using rogue peripherals. We assume that the CPU firmware is trusted. Similar to other TEE proposals, side-channel attacks are out of scope [CD16]. However, we will discuss the implications of our proposal on existing side-channel attacks and defenses in Section 6. Finally, we consider denial-of-service attacks to be out of scope.

Security Goals
G1: Enclave protection The enclave's private data must remain confidential and integrity protected at all times. This includes protection from malicious enclaves, DMA attacks, and rogue specialized hardware.  Figure 1: Two composite enclaves are highlighted by blue and yellow outlines. Both consist of two unit enclaves: Encl1 and the keyboard that is connected over the memory-mapped SPI bus, and Encl2 and a GPU connected over PCI through DMA.
G2: Secure Integration with specialized hardware Specialized hardware must be able to be integrated into an enclave and their communication must remain confidential and integrity protected in all circumstances.

G3: Attestation
Attestation to an enclave should not only cover its code and the genuineness of the processor but also the involved specialized hardware.

Overview
We propose a heterogeneous TEE architecture with a configurable hardware and software TCB. The enclaves that run on top of our design are called composite enclaves. As their name suggests, composite enclaves combine multiple components, such as a normal enclave on the CPU and a specialized hardware device such as an accelerator. To simplify, we call all these individual components unit enclaves. In the following, we highlight how composite enclaves are constructed, starting with how individual unit enclaves communicate, what happens on a failure, and finally, how composite enclaves are attested. In modern platforms, the processor communicates with specialized hardware devices using two mechanisms: memory-mapped IO (MMIO) or direct memory access (DMA). In our design, unit enclaves communicate over shared memory. We leverage existing memory protection mechanisms, such as PMP [WLA + 19] or TZASC [ARM14], which already allow protecting any memory region, including MMIO and DMA regions. However, this implies sharing memory between enclaves, potentially endangering confidential data. We propose an architecture where every enclave has its own private memory and separate shared memory regions depicted in Figure 1, and Figure 2.
However, any of these communicating unit enclaves may encounter failures or other complications at any time, e.g., the unit enclave on the processor might get killed or destroyed without the keyboard noticing. In all of these edge cases, our proposal ensures that no confidential data is leaked (G1 and G2). We achieve this by de-constructing all possible situations into two new enclave life cycle events: connect and disconnect. Intuitively, we provide a way to handle disconnects asynchronously by moving any shared memory region to the sole ownership of the surviving unit enclave. Follow-up synchronous disconnect and connect events may be employed to reestablish new shared memory regions  Figure 2: Example of private and shared memory regions with two enclaves, and a peripheral. Note that the shared memory region between the peripheral and Encl 2 can either be MMIO registers, and thus not backed by actual DRAM, or a DMA region. and continue execution. As mentioned before, our design must support an improved attestation mechanism that includes specialized hardware devices and the communication set up between the devices and the unit enclave on the processor (G3). To provide such an attestation mechanism, we propose a system where the verifier attests to all unit enclaves individually, receiving unique identifiers of connected unit enclaves, and then chains the reports together. However, chaining attestation reports could be vulnerable to timely manipulations in between two such attestations. We describe a mechanism that ensures safe attestation of composite enclaves in the presence of such manipulation attacks (c.f. Section 6.3).

Composite Enclaves
In this section, we describe composite enclaves in detail. Composite enclaves combine unit enclaves on the processor and on specialized hardware devices. First, we discuss the different types of unit enclaves and the necessary changes to specialized hardware to make them compatible. Then we introduce a shared memory model that allows unit enclaves to communicate with each other and specialized hardware securely. Next, we discuss how the enclave life cycle changes given these modifications and how a remote verifier can attest to a composite enclave. Finally, we provide a software design that makes it easier to adapt for software developers.

Unit Enclaves within a Composite Enclave
A composite enclave consists of multiple unit enclaves that run on different hardware components and securely communicate with each other. A composite enclave may span several unit enclaves on the CPU and on specialized hardware. In the following, we describe the two main unit enclave types.
Unit enclaves on the CPU Unit enclaves on the CPU are similar to traditional enclaves, e.g., their runtime memory must be isolated from the OS and should only be accessible to the unit enclave itself. To achieve that, we use physical memory protection (PMP) from the RISC-V privilege standard [WLA + 19] as introduced by Keystone. We further differentiate two types of unit enclaves on the CPU in our software design (Section 5.6): application enclaves and driver enclaves which encapsulate the application and driver logic respectively.
Unit enclaves on specialized hardware Most specialized hardware runs some firmware or even some custom code (e.g., graphic shaders) which must be included in the TCB of a composite enclave. E.g., the GPU and its firmware in Figure 1 is part of the yellow composite enclave. Some specialized hardware may only be usable for a single tenant at a time, whereas others may support multi-tenancy for multiple unit enclaves running simultaneously. Since a remote verifier also wants to attest to specialized hardware devices, they must be modified to support attestation. However, we stress that these modifications remain rather small (c.f. Section 7.2) and are discussed in several upcoming device attestation standards by the industry [DMT20, JPF17].

Changes to Specialized Hardware
A wide range of specialized hardware devices have unique behavior and integrate differently into composite enclaves. In this paper, we try to cover most devices but stress that some special cases require further analysis. We start with the simplest specialized hardware device we can imagine, a simple sensor, to one of the most complex, a sophisticated accelerator for a data center. Most other devices should fall in between these two examples and thus require modifications between these two extremes.
Simple sensors A temperature sensor or other simple sensors only requires a minimal form of attestation to be integrated into composite enclaves. Specifically, it must contain some key material to sign statements about itself. This is mandatory for (remote) attestation of a composite enclave that includes an attestation report of such a sensor. We note that upcoming standards by the industry [DMT20, Int18, JPF17] already propose such attestation mechanisms for various specialized hardware ranging from simple sensors to accelerators. Any simple sensor that already supports such an attestation standard can be integrated into composite enclaves without any hardware changes.
Accelerators On the other hand, accelerators tend to be very complex and may require more extensive modifications. Similar to simple sensors, they must support attestation (e.g., PCIe attestation [Int18]), but they may also require some form of multi-tenancy. Consider data-center applications, where multiple stakeholders want to move multiple compute-intensive tasks from the CPU to an accelerator. The individual tasks' data should remain confidential and isolated, not only on the CPU but also on the accelerator. Thus, such an accelerator requires multiple isolated and attestable domains -in other wordsunit enclaves that run on the specialized hardware.

Communication with Specialized Hardware
To enable unit enclaves on the CPU and specialized hardware to communicate securely, we make the observation that these devices generally communicate over mapped address regions: They either use an address range that is not reflected in DRAM, so-called memorymapped-input-output registers (MMIO), or a shared DRAM region accessed via direct memory access (DMA). To maximize compatibility with existing drivers and specialized hardware, we chose not to change this behavior. Instead, we isolate the address regions that are used in this communication. Existing memory protection mechanisms like PMP already allow restricting access to a specific memory address region. They also allow restricting access to other address regions that are not in the DRAM range 1 . Therefore, our proposal does not require any changes to the processor, as mechanisms such as PMP are already part of many standards [WLA + 19, ARM14]. Note that address regions used by specialized hardware are either i) static, i.e., hardcoded, in the form of a trusted device tree file, or ii) dynamic, i.e., configured at runtime by the SM. In our design, the SM always maintains a complete overview of all such regions and only allows a single unit enclave on the CPU to access an address region of a specialized hardware device.
While we made the changes mentioned above to the SM to support specialized hardware with both MMIO and DMA, they also enable an alternative way for enclaves to communicate: shared memory. This reflects a major difference to traditional TEEs because most traditional enclaves can only communicate through the untrusted OS 2 .
Polling and interrupts specialized hardware is synchronized with the processor with either polling or interrupts. Polling requires the CPU to check at a predetermined rate if new data is available from the specialized hardware, and thus, it is fully compatible with composite enclaves. On the other hand, interrupts enable the specialized hardware to notify the CPU that new data is available with the processor's hardware support. Typically, the operating system registers interrupt handlers which get called when an interrupt occurs. In RISC-V, interrupts can be delegated from the highest privilege mode to lower ones by using either the mret instruction to forward individual interrupts or mdeleg for all interrupts of a specific type [WLA + 19]. So, in our design, the SM delegates relevant interrupts to the interrupt handler of a unit enclave instead of the OS 3 . Note that our prototype currently does not implement interrupt-based synchronization, and hence, we only evaluate polling-based synchronization.

Enclave Life Cycle
The untrusted OS manages specialized hardware devices; hence, the OS could remap any device or send a reset signal. E.g., a GPU handing sensitive data could be shut down by the OS and remapped to a different GPU during runtime. In such a scenario, the composite enclave should stop sending sensitive data to the GPU until the remote verifier re-attests the new GPU and its unit enclave.
Traditional enclave's life cycle includes three distinct states: idle, running, and paused. E.g., the enclave is first created and starts in the idle state. Then the enclave transitions to the running state after a call from a user. Due to a timer interrupt by the OS scheduler, it is paused. It resumed again as soon as the scheduler yields back to the enclave.
Attaching specialized hardware Before going into the life cycle details, it is crucial to understand how specialized hardware is attached to the platform and initialized. There are two types of initialization procedures: statically compiled in the device tree or dynamically mapped by a bus controller. The device tree describes the specific address ranges and model numbers of all statically connected specialized hardware devices. It is usually stored in on-chip ROM and is provided to the OS by a zero-stage boot-loader, and thus, it can be considered trusted. Dynamically mapped devices are mapped by a bus controller and a driver to a DMA region. In our proposal, the bus controller's driver, which sets up the DMA region, has to be trusted (but it could reside in its own unit enclave) Changes during runtime In all unit enclaves, we introduce two additional life cycle events to describe what happens when a shared memory region is altered. These are connect and disconnect that are needed due to the asynchronous nature of specialized hardware, as a disconnect event could happen at any time.
The asynchronous disconnects are very critical as a composite enclave could end up continuing to use a memory region that is no longer protected due to a disconnect. Additionally, composite enclaves might want to provide graceful degradation and should not crash completely upon a disconnect. We solve both issues by splitting the disconnect event into an asynchronous disconnect and a synchronous disconnect. We consider both unit enclaves or specialized hardware of a shared memory region to have shared ownership over that region. If one of the entities dies, the other entity gains the sole ownership of the memory region. As such, an asynchronous disconnect leads to the sole ownership of a previously shared memory region. In turn, the untrusted OS can issue a synchronous disconnect command to the SM to free the shared memory region and notify the composite enclave and all its unit enclaves of the disconnect. We mandate that before any connect command, the unit enclave must first receive a synchronous disconnect. If this was not the case, an adversary could disconnect a benign specialized hardware device and reconnect a malicious one without the enclave noticing.
We illustrate the behavior of composite enclaves using an example scenario. Unit enclave 1 (E 1 ) connected to unit enclave 2 (E 2 ), which, in turn, is connected to a specialized hardware device (HW ). We denote the shared memory regions as S {E1,E2} , and S {E1,HW } that is shared among E 1 & E 2 , and E 1 & HW respectively.
1. E 1 is killed In such a situation, the specific shared memory region S {E1,E2} should be destroyed. To do that, the SM performs an asynchronous disconnect of E 1 for S {E1,E2} resulting in sole ownership of S {E1,E2} by E 2 . Upon the following synchronous disconnect S {E1,E2} gets fully destroyed. An application may require any sensitive data from E 1 that still remains on HW to be cleared. In such a scenario, E 2 will tell HW to clear this data on the following synchronous disconnect.
2. E 2 is killed All shared memory regions associated with E 2 (this includes the shared memory regions with both E 1 and HW ) are immediately modified by the SM during the asynchronous disconnect. They are now solely owned by E 1 and HW , respectively. Zeroing out S {E2,HW } also implicitly notifies HW that E 2 has died, forcing the specialized hardware to reset.

HW is killed/disconnected
In the asynchronous disconnect, the SM immediately modifies S {E2,HW } to S {E2} . At some later point, the OS must issue a synchronous disconnect, which invalidates S {E2} . This also results in the destruction of S {E1,E2} in case E 1 accesses HW through E 2 . From then on E 2 is available to connect to a new HW (after attestation).

Attestation of a Composite Enclave
We extend the existing notion of attestation from traditional enclaves to composite enclaves that run on multiple specialized hardware devices within the platform. Traditionally, attestation ensures the current state of an enclave through a measurement of the code. The standard attestation report of a traditional enclave contains the measurements of the enclave and the low-level firmware (e.g., the security monitor in RISC-V keystone or µCode in SGX). Both of which are signed by the platform key (known as the device root key). In contrast, the attestation of a composite enclave must also reflect all included unit enclaves and corresponding specialized hardware devices. A potential attestation mechanism for a composite enclave could be a lengthy report containing all the components' measurements, including the specialized hardware (similar to related device attestation standards [DMT20,Int18,JPF17]). Contrary to that, we provide the verifier with an option to decide which other unit enclaves he wants to attest. When the verifier attests a specific unit enclave, a list of identifiers of all connected unit enclaves is provided alongside the attestation report. These identifiers are assigned by the SM and can be used to specify which unit enclave one wants to attest. A verifier can then chose to attest some or all the connected unit enclaves from the list of identifiers. Unit enclave identifiers Upon creation of a new unit enclave, the SM assigns a unique identifier to it. This identifier uniquely determines the unit enclaves participating in a specific shared memory region. When the unit enclave is killed, the identifier may be reused for other unit enclaves (c.f. Section 6). Figure 3 depicts an example composite enclave and the sequence of the attestations between its different unit enclaves. The example composite enclave contains three unit enclaves: enclave 1 (E 1 ), enclave 2 (E 2 ), and the firmware of a specialized hardware device. Note that the attestation process starts from the verifier who initiates a remote attestation request of E 2 . The attestation report of E 2 includes a list of connected unit enclaves' identifiers, notably E 1 . The verifier then executes a series of individual remote attestations to all connected unit enclaves. Note that both individual attestations of E 1 and E 2 include each other's identifiers in their list of connected components. Also, both the attestation reports of E 1 and E 2 are signed by the same platform key. This proves to the remote verifier that both unit enclaves are running on the same platform.

Attestation flow
For specialized hardware, the attestation mechanism is different. First of all, a specialized hardware device needs to contain some key material and a signed certificate from the manufacturer. This allows a verifier to observe the legitimacy of the device. Secondly, the verifier from Figure 3 needs to be able to verify that the specialized hardware is directly talking to E 1 . This is facilitated by the SM, who checks the address regions for MMIO registers. DMA regions can even be established by an untrusted entity such as the OS. However, the attestation report of both the specialized hardware and E 1 contains the physical memory region that they share.

Software Design
In this section, we introduce composite enclave's software design which is one possible way for application, driver, and firmware developers to adapt their software to be compatible with composite enclaves with minimal effort.

Software components
Composite enclave's software design consists of three entities: application enclaves, driver enclaves, and firmware on specialized hardware devices, as shown in Figure 4. Application enclaves and driver enclaves are unit enclaves on the CPU. Specialized hardware is connected to the platform over buses. Contrary to a monolithic design where the application and driver are fused into one big enclave, our modular approach aims to provide high flexibility and increase code reuse.
Application enclaves Application enclaves are similar to the traditional enclaves in Intel SGX or Keystone. In such TEEs, the enclaves cannot access specialized hardware without using the OS as a mediator, as the OS handles all drivers. Application enclaves also cannot communicate with specialized hardware directly. The application enclaves use shared memory to communicate with a driver enclave that then communicates with the specialized hardware device. The rationale of separating the driver from the application logic is two-fold, i) to avoid requiring the developers to ship driver code with their application, and ii) one driver enclave per specialized hardware allows multiple application enclaves to communicate with that specific specialized hardware device in parallel.

Driver enclaves
The driver enclave contains the driver that facilitates communication with a specialized hardware device, and it may mediate any access to the device (e.g., rate-limiting). Note that application enclaves, standard non-enclave applications, and the OS can no longer access the specialized hardware device directly. The only way to communicate with the device is through the device-specific driver enclave. Such a design choice isolates the drivers: one compromised driver does not affect other composite enclaves. The driver enclave maintains an isolated communication channel over shared memory to application enclaves and the specialized hardware device. To simplify the configuration, we assume that only one active driver enclave per specialized hardware exists at a time. However, any driver enclave can be replaced at the user's request.

Isolation of multi-application enclave session
Multiple application enclaves could connect to a single driver enclave to have simultaneous access to a specialized hardware device. In such a scenario, the driver enclave keeps separate states corresponding to each of the application enclaves. Note that this is primarily a functional and then a security requirement as operations in one application enclave could affect the computation of another application enclave if there is no isolation. For some devices, the driver enclave may need to reset the state of the specialized hardware when it switches to a session with a different application enclave (temporal separation). However, sophisticated accelerators may support multiple isolated workloads in parallel (spatial isolation), and thus the state does not have to be reset.

Security Analysis
In this section, we informally analyze the security of composite enclaves. First, we show how isolation from a malicious OS (G1) and malicious specialized hardware (G2) is achieved including a number of relevant side-channel attacks. Then we analyze the life cycle events of unit enclaves and discuss the security of the attestation of composite enclaves (G3).

Isolation
Malicious OS We leverage PMP entries [WLA + 19] to protect address regions that are used by unit enclaves. Recall that in stock keystone [LKS + 20], the PMP configuration only allows each enclave to access its private memory (G1). On top of this, we use additional PMP entries to protect shared memory regions (G2). Note that only the highest privilege level, i.e., the SM, can modify PMP entries. During a context switch, the SM re-configures all PMP entries such that the correct memory ranges are available again. The SM has the complete overview over all unit enclaves and shared memory regions and sets up all PMP entries on its own. The processor will throw an access fault exception upon any memory access into protected memory regions. The hardware page table walker also must behave according to the configured PMP rules. Therefore, miss-configured page tables cannot be used to leak any data from protected memory ranges.
The SM enforces a shared memory region to be strictly shared between two entities (e.g., a unit enclave on the CPU and a specialized hardware device). The SM also verifies that no overlap exists between the memory ranges similar to stock Keystone.
Rogue DMA requests Malicious peripherals may try to access protected memory through rogue DMA requests. However, mechanisms to mitigate such attacks already exist in most architectures, e.g., AMD IOMMU [AMD07], Intel VT-d [AJM + 06], and ARM SMMU [Hol13]. These mechanisms process every DMA request and verify its validity according to some access policy. Any memory access attempt that does not fit the access policy is blocked (G1 and G2). Currently, there is no standardized mechanism to limit such DMA requests in RISC-V. However, there is a proposal of an input-output variant of PMP called IOPMP [Ku21]. IOPMP enforces the configured PMP rules for non-RISC-V peripherals and mitigates DMA attacks completely.

Malicious application or driver enclaves
The attacker-controlled OS can spawn malicious application enclaves and driver enclaves. However, users remotely attest before providing any secret to the application enclave. During the attestation, the user checks the attestation report of both the application and driver enclave and aborts if they do not match the intended enclave measurements. The attestation also reveals any misconfiguration of communication links by an adversary (G2 and G3). Note that this only verifies the current configuration of communication links. Upon any change to this setup, the application enclave might require the external verifier to re-attest (c.f. Section 6.2).
We require the driver enclave to provide isolation between multiple connected application enclaves (c.f. Section 5.6.2). Hence an attacker-controlled application enclave cannot access the confidential data of other application enclaves in the same driver enclave.
Vulnerabilities within any of these unit enclaves could break the isolation guarantees of the data in that specific unit enclave. However, such an attack remains contained in the compromised unit enclave and cannot spread to other connected enclaves. E.g., if a vulnerability in a driver enclave is found, only the data within that enclave is revealed. Any data that does not pass through the compromised driver enclave remains confidential. In this way, we provide defense-in-depth and reduce the potential impact of vulnerabilities.

Malicious specialized hardware
If an adversary manages to compromise the exact device used by a composite enclave, then any data on the device is forfeit. However, any data not passed to the malicious device remains confidential (G2). We stress that certain manipulations of specific peripherals are always possible for an adversary. Consider, for example, a temperature sensor. Any local physical adversary can increase the real-world temperature and thus manipulate the sensor reading. However, as we describe in our attacker model in Section 3.3, the physical attacker is out-of-scope of this paper.
Remapping Attacks Many specialized hardware devices are plug-and-play and thus dynamically mapped by the OS. Therefore, the OS may also change the mapping during runtime, potentially leading to confidential data being shared with the wrong entity. We analyze all types of dynamically mapped specialized hardware and how our proposal prevents such a remapping attack (G2).
Dynamically mapped specialized hardware devices can use one out of the following mechanisms: i) a DMA region which facilitates all communication, ii) a bus controller driver facilitates the communication, or iii) a mix of both of these. Note that MMIO interfaces are generally not dynamic and do not change during runtime.
In remapping attacks against pure DMA devices, the OS may remap the DMA buffer to a different address range. There are two weak points where confidential data could leak: the unit enclave on the CPU could share confidential data with a remapped untrusted device, or the device could share results with the wrong entity on the processor. However, the OS needs to notify the device of the remapping (if this does not happen, the device will write to the wrong address), so the second potential leakage is ruled out immediately. In the other case, it is essential to note that the shared memory region of the unit enclave remains protected by PMP entries. Thus, even after remapping, the OS cannot access the shared memory region containing confidential data.
If the communication is (partially) facilitated by the bus controller, the bus controller and its accompanying driver must be part of the TCB since both of them process all communication and may leak confidential data.
Side-channel attacks While we do not evaluate any defenses against side-channel attacks, we discuss potential side-channel attacks against our proposal and how they could be mitigated. Many parts of composite enclaves remain the same as in traditional TEEs, where side-channels have been widely investigated [BCD + 19, BMD + 17, GLS + 17] (G1). However, we note that our approach creates some new side-channels that may not be present in traditional TEEs, such as bus contention (typically related to G2).
Microarchitectural side-channels in traditional TEEs leverage shared resources such as the cache [BMD + 17], branch predictor [LSG + 17], and memory translation [XCP15]. There exist several defenses against such attacks. Spatial partitioning of the cache in the form of cache coloring can fully defend against all cache-based side-channel attacks [CLD16,ZDS09,ZKGA20]. Similarly, other proposals have called for cache randomization [BCD + 19, WUG + 19]. Processor features such as transactional memory have also been shown to mitigate cache attacks with low overhead [GLS + 17]. To the best of our knowledge, all of these proposals can be applied to composite enclaves due to the similar internal structure to traditional TEEs. Specialized hardware contain shared resources such as caches, and thus are equally vulnerable as the processors [NNQAG18, PAS + 20, RPD + 18]. However, mitigating these attacks is an orthogonal problem.
The introduction of specialized hardware into TEEs also implicates the bus as a new shared resource. An adversary could measure the throughput of her connection over the bus and observe any contention on the bus leading to less throughput. Bus contention, however only exposes the bus access patterns. In extreme cases, the timing of bus contention could leak data, e.g., one side of the branch performs bus accesses while the other does not [PLF21]. This behavior is very similar to previous timing and side-channel attacks, and there exist multiple mitigations, such as oblivious execution [RLT15], that can be applied in the same way to the bus side-channel.

Life Cycle Events
As described in Section 5.4, we introduce two additional life cycle events for unit enclaves. Connect is used to connect two unit enclaves over a shared buffer, whereas disconnect facilitates a disconnect. The disconnect is split into a synchronous and a asynchronous event. The asynchronous disconnect only occurs when one of the unit enclaves unexpectedly dies and results in the transfer of the sole ownership of the memory region to the remaining enclave. This enclave can then try to continue its execution. However, it will realize that the other unit enclave has died as it does not react to any activity on the shared memory region. At a later point, the untrusted OS can issue a synchronous disconnect to notify the unit enclave and free the shared memory officially. Note that the SM mandates a synchronous disconnect before another connect command. Due to this architecture, a stale shared buffer will never be made accessible to any untrusted entity until a synchronous disconnect occurs, during which the unit enclave will officially get notified. The separate handling of synchronous and asynchronous disconnect events enforces protection for any secret data during an enclave's entire life cycle (G2).

Attestation
As specified in G3, the attestation of a composite enclave should also cover all connected unit enclaves. In our proposal, the attestation report of a unit enclave contains identifiers of all the connected unit enclaves. The SM generates these identifiers and makes sure that no two running unit enclaves share same identifier. Hence, a unit enclave could be assigned with an identifier that belonged to a unit enclave in the past. Of course, strictly increasing identifiers implemented with monotonic counters could be used for the identifier but such a solution needs a non-volatile storage on the CPU that might be expensive. Now assume that the adversary kills an unit enclave and launches another unit enclave with a different binary (defined as code), but with the exact same identifier. I.e., the attacker can kill unit enclave A and launch A , code(A ) =code(A), with the same identifier, ID(A)=ID(A ). However, when a remote verifier attests A , the verifier sees that the measurements mismatch as code(A ) =code(A) and rejects it.
Lets assume a more complex scenario with two pairs of unit enclaves: A, B and A , B , where code(A ) = code(A) but code(B ) = code(B). A remote verifier attests to a unit enclave A that is connected to B and and establishes a shared secret with A. Before the verifier attests to B, the attacker kills B. The attacker then spawns a new unit enclave B where ID(B)=ID(B ). The remote verifier will then attest to B and find that the code measurement looks fine. However, we stress that B cannot be connected to A because then A would need to receive a synchronous disconnect and would need to be re-attested (due to the configuration of A). If the attacker also kills A and replaces it with A (where ID(A)=ID(A )) and connects A and B . The verifier would then see that B has the correct measurement and is connected to the identifier of A (as ID(A)=ID(A )). However, the verifier will want to provide its data to A using the shared secret they have established in the previous attestation. Obviously, this cannot succeed as the new unit enclave A cannot know the secret.

FPGA Prototype
We implemented an end-to-end prototype of a composite enclave on a softcore on an FPGA running a modified Keystone enclave framework [LKS + 20] (available online [Sch21]). Figure 5 shows one of our experimental setups consisting of an FPGA emulating the central processor connected to several Arduino boards that emulate specialized hardware.

FPGA platform
We base hardware platform on the Ariane core [ZB19], an open-source RISC-V 64-bit core that supports commodity OS such as Linux. It is an RC64GC 6-stage application class core that has been taped out multiple times and can operate up to 1.5 GHz. We run this core on a Digilent Genesys 2 FPGA board (x in Figure 5).
Since the core originally did not support PMP, we added PMP capability in around 160 lines of SystemVerilog. The PMP unit is formally verified with a bounded model check against a handwritten specification with yosys [Wol16]. Two of these units are inserted into the memory management unit (MMU) and are responsible for checking data accesses and instruction fetches. An additional unit is placed in the hardware page table walker to check page table accesses. Our implementation has a configurable number of PMP entries up to the maximum number of 16 mandated by the standard [WLA + 19]. Our modifications have been contributed to the Ariane project and are open source [Zar20]. Note that PMP is part of the RISC-V privilege standard and as such is already available on many other cores [AAB + 16, Low20].

Modifications to Keystone
We modified the SM to be able to connect two unit enclave or an unit enclave and specialized hardware. Specifically, we added three new interfaces to the SM called connect, sync_disconnect, and async_disconnect. These interfaces can be used to set up shared regions between two unit enclaves or specialized hardware specified by their identifier. We also modified Keystone's attestation procedure to include a list of identifiers for all connected unit enclave. Our modifications only amount to 390 additional or modified lines of code. The SM consists of around 2000 lines of code excluding SHA3 and ed25519 implementations that contribute around 4000 additional lines of code.
Every enclave runs on top of a trusted minimal runtime that handles syscalls and manages virtual memory. For our prototype, we added support to dynamically map shared  Figure 6: An architecture overview of one compute cluster of our modified accelerator with one PMP control unit per cluster and individual PMP enforcement units per core.
memory regions into the virtual address space of a unit enclave. We modified 213 LoC out of 3600 LoC for Keystone's runtime.
Simple specialized hardware In our prototype, we emulate a number of simple specialized hardware (e.g., keyboard, mice, simple sensors, etc.) on the Arduino Due microcontroller prototyping board (y in Figure 5) using the Arduino HID library. The Due's GPIO pins are connected to the FPGA's PMOD pins over two pairs of 8 wires for bi-directioanl data. We modify the I 2 C protocol to communicate data between the Due and the FPGA. The physical limitations of the PMOD pins restricts the channel's frequency to 8 MHz yielding 1 MB/s bandwidth. In the real world, the physical interfaces between the specialized hardware and the platform could be diverse such as USB and PCI-E. As a concrete example, we implemented a keyboard with the Arduino board and wrote a simple keyboard driver that interprets the GPIO signal from the Arduino. Additionally, we use a PMOD interfacebased seven-segment display unit as an output peripheral (z in Figure 5). The driver contains around 50 LoC and is incorporated into our example driver enclave. Additionally, we use the USBHost library that can emulate a number of USB peripheral devices on the Arduino. We use the Arduino cryptographic library for signing the challenge messages from the driver enclave during the local attestation. The Due uses 128-bit AES (CTR mode) for encryption, HMAC_SHA256 for message authentication, Curve25519 for key exchange, and SHA3 for the hash function. We use DueFlashStorage library to implement the NVM flash that contains the key material for the peripheral attestation. Our prototype implementation is approximately 2.5K lines of code.

Accelerator
We conduct another case study to show how complex specialized hardware such as a GPUscale accelerator [ZSB21] can be extended to support composite enclaves. The accelerator is a 4096-core RISC-V platform that has comparable performance to current machine learning accelerators. It is organized in clusters each with 8 individual single-stage RISC-V cores [ZSHB21], each of which is accompanied by a double precision floating point unit capable of two double precision and four single precision flops per cycle. To hide memory latency, all clusters have access to a scratchpad memory and a large L2 data cache.
To provide multi-tenant isolation on the accelerator, we introduce a shared PMP control unit with 4 entries into every cluster. Every core then has its own PMP enforcement unit. The PMP entries can only be configured by one out of eight cores but the access policies will be enforced on all of them. The architecture of the modified compute cluster is shown in Figure 6. With this additional hardware support we were able to implement a small firmware that configures the PMP entries according to the specifications from the host and then runs a task in user mode. Upon a context switch, the scratchpad memory that was in use by the previous task is flushed and the PMP entries are reconfigured. The firmware consists of 143 lines of assembly and 73 lines of C code.

Evaluation
Performance of Inter-Enclave Communication As composite enclaves supports shared memory to communicate, its communication speed is the same as what the memory bus provides. This is much faster compared to traditional TEEs, where enclaves communicate through the OS requiring extra encryption steps. Hence, we do not believe a comparison between these two systems is meaningful. Concurrent work also demonstrates the performance gains by using shared memory between enclaves [YSCS20].

Context Switch Performance
Context switches are critical for any system and determine its responsiveness and a part of its performance. We performed experiments for various sizes of shared memory region and gathered various context switch latencies in Figure 7. We also measured the time of unit enclave creation which is mostly dominated by copying all the unit enclave data from the untrusted OS to the protected memory region and thus is expected to be linear in terms of memory size. Our measurements highlight that the context switches are independent on the shared memory size. The absolute context switch time increases from 4730 for stock Keystone to 4950 for our prototype.

PMP Overhead
We measure the hardware overhead of PMP units in terms of the logic, the caches, and the total amount in NAND2 gate equivalents within the Ariane processor pipeline for 0, 8, and 16 PMP entries in Table 1. We instantiate the Ariane core [ZB19] with the default configuration: including the floating point unit, 32KiB L1 data cache, 16KiB L1 instruction cache, branch history table of size 64, and a 16-entry branch target buffer. We synthesized this core configuration in a 22nm technology at 1GHz.

IO Peripherals
The communication overhead between the platform and the peripheral device emulated by the Arduino due is very small. At the time of initialization, the peripheral and the platform exchanges handshake messages to perform local attestation. The initial handshake message is 60 bytes. Every message size of our modified I 2 C protocol is 32 bytes. The combined latency introduced by signing averages around 60 µs.  Accelerator Our modification of the accelerator cores slows down from 750MHz to 666MHz due to the impact of the PMP access checks on the critical path. Note that this may not reflect the general case. The change in area of a single core complex (core, FPU, and an integer subsystem) can be found in Table 2 for 750 and 666 MHz respectively. Note that the size of the core increases due to the increased pressure by the PMP, while the FPU and the IPU get smaller with the lower clock as their critical path is not affected by the PMP entries. In total, the area of the entire accelerator decreased by around 0.7% while the clock frequency was reduced by 15%.

Limitations
Remote Attacker Model As mentioned in Section 3.3, we only consider a remote adversary throughout this paper. For some use-cases it is impossible to consider a local phyiscal adversary who could, for example, change the environment that is measured via a sensor. However, this fundamental limit does not apply to more sophisticated specialized hardware such as accelerators. In this case, our proposal could be extended by two hardware modifications to cope with a phyiscal adversary: First, the CPU needs to support memory encryption and integrity, a typical mechanism that many TEEs already employ [Int13, KPW16, SCG + 03]. Second, the communication channel between specialized hardware and the CPU, i.e., the bus, must provide confidentiality and integrity. Existing proposals from industry and academia [Gue16, KPW16, SCG + 03] indicate that such encryption capabilities are feasible and might become available in the near future.

Limited Number of PMP Entries
The number of PMP entries in the RISC-V privilege specification is limited to 16 (an extension to 64 is in discussion). This limits the number of unit enclaves and shared memory regions that may coexist on a system. Assuming one shared memory region per unit enclave, at most (N − 2)/2 unit enclaves can exist at a time (16 entries support 7 unit enclaves). However, isolation of unit enclaves could also be achieved using the memory management unit (MMU) in a similar fashion as Intel SGX [CD16]. MMU-based isolation can also easily be extended to shared memory ranges and remove any limitation on the maximum number of unit enclaves.

Large Drivers
Specialized hardware devices can be very complex and require major drivers to work. As an example, an open-source driver for AMD GPUs in the Linux kernel occupies around 3.3 million LoC [T + 21] (most of it are generated header files, 500k LoC without headers). Moreover, such drivers also leverage other capabilities of the kernel, and moving such a driver into a single driver enclave would require to replicate these capabilities. However, such a driver (e.g., for a GPU) was not created for a minimal TCB but for feature completeness. It could be possible to strip such drivers to the bare minimum needed to support the actual application enclave.  [VVB18]. Such multi-tenant GPU TEEs would fit very well within a composite enclave as it is an excellent example of an enclave on specialized hardware and it shows that even some of the most powerful accelerators can be extended with a local TEE. Visor [PAS + 20] goes even further and proposes a hybrid TEE that spans over both CPU and GPU and their communication. Visor is aimed towards privacy-preserving video analytics where the computation pipeline is shared between the CPU (non-CNN workloads) and the GPU (CNN workloads) to increase efficiency. HETEE [ZHW + 20] is another proposal to extend TEEs to GPUs without requiring changes to existing hardware. HETEE focuses on datacenter applications and proposes an extra hardware box per rack that is protected from physical attacks and contains all GPUs. Each enclave then runs on a dedicated compute server and a connected accelerator. In essence, the HETEE box provides secure routing of accelerators to dedicated compute servers. In contrast to HETEE, we aim to be able to execute multiple composite enclaves on the same machine.

Related Work
ARM TrustZone is a system TEE provided by ARM for their system-on-chips [Win08]. TrustZone applications run on top of a secure OS that is trusted and isolated from the standard OS (rich OS). TrustZone only provides the lower level isolation property between the rich OS and the secure OS with an extra bit on the bus. Everything else, i.e., isolation between TrustZone applications or remote attestation, has to be implemented by the secure OS [Nin14]. Due to this limitation, manufacturers usually only allow TrustZone applications that are signed by them. Sanctuary [BGJ + 19] extends TrustZone with userspace enclaves. Sanctuary achieves isolation by running enclaves in their own address space in the normal world. However, it does not extend to external specialized hardware. Some other proposals [YAA + 18, LMH + 14, LSDB18, LLS + 18] enable additional security properties such as a trusted path by enabling direct pairing of peripherals (e.g., the touchscreen) to TrustZone applications. However, these are only geared towards IO for trusted path and do not support generic (dynamic) devices.
Finally, CURE [BBD + 21] proposes a TEE architecture that enables enclaves on all privilege levels. As such, CURE also enables enclaves that have exclusive access to specific peripherals against a software adversary similar to our approach. However, attestation to an enclave in CURE does not extend to peripherals. Besides, kernel-space enclaves in CURE run on a reserved core with, to the best of our knowledge, no option to yield back to the OS, and thus, wasting resources while waiting for new data from peripherals.

Other Isolation Methods
Minimal hypervisors or microkernels [HBG + 06] can also achieve isolation, and some are even formally verified [KEH + 09, VCJ + 13]. Usually, such proposals do not natively support attestation. However, by adding a root-of-trust and some minor software components to measure and sign applications, microkernels could be extended to support a simple form of remote attestation similar to academic TEE proposals [LKS + 20]. While there are other challenges to overcome such as key distribution and revocation, and software updates, these challenges are identical for TEEs and have been handled in the past [Int13,KPW16]. From a TCB perspective, hypervisors and microkernels include a scheduler, moving it from the untrusted OS to the TCB, and thus, may result in a bigger TCB.
Bump in the Wire-based Solutions Fidelius [ECB + 19], ProtectIOn [DUKC20], Inte-griScreen [SUD + 20], FPGA-based overlays [BT17], IntegriKey [DYKC17] are some of the trusted path solutions that use external trusted hardware devices as intermediaries between the platform and IO devices. These external devices create a trusted path between a remote user and the peripheral and enable the user to exchange sensitive data securely with the peripheral in the presence of an attacker-controlled OS.

Related Standards
Recently, there have been multiple upcoming standards backed by major players from the industry focused on new bus architectures [Con17,Con20]. These proposals are motivated by the move to more specialized hardware and to disaggregated computing. CCIX [Con17] tries to extend PCIe with a cache coherency protocol to allow multiple chips to have the same view of memory. All chips connected with CCIX may have their own memory, cache, and compute. However, all chips interconnected with CCIX are equally privileged, leading to a rather bleak security outlook for CCIX.
The other upcoming standard, CLX [Con20], assumes current platform architecture similar of today with a host processor connected to multiple accelerators. As such, CLX is able to simplify the protocol by following a master-slave principle. CLX allows accelerators to cache shared memory. As such, the interaction between the CPU and accelerators no longer need expensive copying operations and both may even operate on the same data at the same time. CLX also has some provisions for link-encryption leveraging authenticated encryption to defend against bus tapping attacks. However, CLX is only a bus architecture and does not consider and adversary in either the accelerator or the host. Nevertheless, CLX would be a prime candidate to integrate into composite enclaves.