The rapid rise in social connectivity and e-commerce requires increasing computer memory to store and process larger amounts of data with low latency.
But more computer memory means more memory errors, which can disrupt service availability and cause data loss.
In this context, SHEPHERD investigates cross-layer memory resilience techniques where software and hardware work together to prevent service unavailability and data loss due to memory errors in disaggregated memory with low overhead.
Publications
2024
-
Taming Performance Variability caused by Client-Side Hardware Configuration
Georgia Antoniou, Haris Volos, and Yiannakis Sazeides
In IISWC ’24: Proceedings of the 2024 IEEE International Symposium on Workload Characterization 2024
-
Agile C-states: A Core C-state Architecture for Latency Critical Applications Optimizing both Transition and Cold-Start Latency
Georgia Antoniou, Davide B. Bartolini, Haris Volos, Marios Kleanthous, Zhe Wang, Kleovoulos Kalaitzidis, Tom Rollet, Ziwei Li, Onur Mutlu, Yiannakis Sazeides, and Jawad Haj-Yahya
ACM Transactions on Computer Architecture and Code Optimization 2024
Latency critical applications running in modern datacenters exhibit irregular request arrival patterns and are implemented using multiple services with strict latency requirements (30us–250us). These characteristics render existing energy saving idle CPU sleep states ineffective due to the performance overhead caused by the state’s transition latency. Besides the state transition latency, another important contributor to the performance overhead of sleep states is the cold-start latency, or in other words, the time required to warm-up microarchitectural state (e.g., cache contents, branch predictor metadata) that is flushed or discarded when transitioning to a lower-power state. Both the transition latency and cold-start latency can be particularly detrimental to the performance of latency critical applications with short execution times.
While prior work focuses on mitigating the effects of transition and cold-start latency by optimizing request scheduling, in this work, we propose a redesign of the Core C-state architecture for latency-critical applications. In particular, we introduce C6Awarma new Agile Core C-state that drastically reduces the performance overhead caused by idle sleep state transition latency and cold-start latency, while maintaining significant energy savings. C6Awarm achieves its goals by implementing 1) medium-grained power gating, 2) preserving the microarchitectural state of the core and 3) by keeping the clock generator and PLL active and locked. Our analysis for a set of microservices on an Intel Skylake-based server, shows that C6Awarm manages to reduce the energy consumption by up to 70% with limited performance degradation (at-most 2%).
2021
-
The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory
Volos, Haris
IEEE Computer Architecture Letters 2021
Disaggregated memory leverages recent technology advances in high-density, byte-addressable non-volatile memory and high-performance interconnects to provide a large memory pool shared across multiple compute nodes. Due to higher memory density, memory errors may become more frequent. Unfortunately, tolerating memory errors through existing memory-error protection techniques becomes impractical due to increasing storage cost. This letter proposes replication-aware memory-error protection to improve storage efficiency of protection in data-centric applications that already rely on memory replication for performance and availability. It lets such applications lower protection storage cost by weakening the protection of each individual replica, but still realize a strong protection target by relying on the collective protection conferred by multiple replicas.
Funding info
Grant agreement ID: 101029391
Start date: 1 September 2021
End date: 30 December 2024
Funded under: H2020-EU.1.3.2.
Coordinated by: University of Cyprus