AgilePkgC: An Agile System Idle State Architecture for Energy Proportional Datacenter Servers
Georgia Antoniou, Haris Volos, Davide B. Bartolini, Tom Rollet, Yiannakis Sazeides, and Jawad Haj-Yahya
Modern user-facing applications deployed in datacenters use a distributed system architecture that exacerbates the latency requirements of their constituent microservices (30-250\mus). Existing CPU power-saving techniques degrade the performance of these applications due to the long transition latency (order of 100\mus) to wake up from a deep CPU idle state (C-state). For this reason, server vendors recommend only enabling shallow core C-states (e.g., CC1) for idle CPU cores, thus preventing the system from entering deep package C-states (e.g., PC6) when all CPU cores are idle. This choice, however, impairs server energy proportionality since power-hungry resources (e.g., IOs, uncore, DRAM) remain active even when there is no active core to use them. As we show, it is common for all cores to be idle due to the low average utilization (e.g., 5-20%) of datacenter servers running user-facing applications. We propose to reap this opportunity with AgilePkgC (APC), a new package C-state architecture that improves the energy proportionality of server processors running latency-critical applications. APC implements PC 1A (package C l agile), a new deep package C-state that a system can enter once all cores are in a shallow C-state (i.e., CC1) and has a nanosecond-scale transition latency. PC 1A is based on four key techniques. First, a hardware-based agile power management unit (APMU) rapidly detects when all cores enter a shallow core C-state (CC1) and triggers the system-level power savings control flow. Second, an IO Standby Mode (IOSM) places IO interfaces (e.g., PCIe, DMI, UPI, DRAM) in shallow (nanosecond-scale transition latency) low-power modes. Third, a CLM Retention (CLMR) mode rapidly reduces the CLM (Cache-and-home-agent, Last-level-cache, and Mesh network-on-chip) domain’s voltage to its retention level, drastically reducing its power consumption. Fourth, APC keeps all system PLLs active in PC 1A to allow nanosecond-scale exit latency by avoiding PLL re-locking overhead. Combining these techniques enables significant power savings while requiring less than 200ns transition latency, \gt250\times faster than existing deep package C-states (e.g., PC6), making PC 1A practical for datacenter servers. Our evaluation based on an Intel Skylake-based server shows that APC reduces the energy consumption of Memcached by up to 41% (25% on average) with <0.1% performance degradation. APC provides similar benefits for other representative workloads.