AgileWatts and AgilePkgC at MICRO 2022
This week members from our team and collaborators from Huawei Zurich attended the 55th IEEE/ACM International Symposium on Microarchitecture, where they presented our work on making datacenter servers more energy efficient. MICRO is the premier forum for presenting, discussing, and debating innovative microarchitecture ideas and techniques for advanced computing and communication systems. We were fortunate to have two of our research papers accepted and presented at the conference.
The first talk delivered by Jawad Haj Yahya was about AgileWatts, an energy-efficient CPU core idle-state architecture for latency-sensitive server applications. User-facing applications running in modern datacenters exhibit irregular request patterns and are implemented using a multitude of services with tight latency requirements. These characteristics render ineffective existing energy conserving techniques when processors are idle due to the long transition time from a deep idle power state (C-state). While prior works propose management techniques to mitigate this inefficiency, we tackle it at its root with AgileWatts: a new deep C-state architecture optimized for datacenter server processors targeting latency-sensitive applications. AgileWatts is based on three key ideas. First, AgileWatts eliminates the latency overhead of saving/restoring the core context (i.e., micro-architectural state) when powering-off/-on the core in a deep idle power state by i) implementing medium-grained power-gates, carefully distributed across the CPU core, and ii) retaining context in the power-ungated domain. Second, AgileWatts eliminates the flush latency overhead (several tens of microseconds) of the L1/L2 caches when entering a deep idle power state by keeping L1/L2 cache content power-ungated. A minimal control logic also remains power-ungated to serve cache coherence traffic (i.e., snoops) seamlessly. AgileWatts implements sleep-mode in caches to reduce caches leakage power consumption and lowers a core voltage to the minimum operational voltage level to minimize the leakage power of the power-ungated domain. Third, using a state-of-the-art power efficient all-digital phase-locked loop (ADPLL) clock generator, AgileWatts keeps the PLL active and locked during the idle state, further cutting precious microseconds of wake-up latency at a negligible power cost.
The second talk delivered by Georgia Antoniou was about AgilePkgC, an agile system idle state architecture for energy proportional datacenter servers. Modern user-facing applications deployed in datacenters use a distributed system architecture that exacerbates the latency requirements of their constituent microservices. Existing CPU power-saving techniques degrade the performance of these applications due to the long transition latency (order of 100s) to wake up from a deep CPU idle state (C-state). For this reason, server vendors recommend only enabling shallow core C-states (e.g., CC1) for idle CPU cores, thus preventing the system from entering deep package C-states (e.g., PC6) when all CPU cores are idle. This choice, however, impairs server energy proportionality since power-hungry resources (e.g., IOs, uncore, DRAM) remain active even when there is no active core to use them. As we show, it is common for all cores to be idle due to the low average utilization (e.g., 5-20%) of datacenter servers running user-facing applications. We propose to reap this opportunity with AgilePkgC, a new package C-state architecture that improves the energy proportionality of server processors running latency-critical applications. AgilePkgC implements PC 1A (package C l agile), a new deep package C-state that a system can enter once all cores are in a shallow C-state (i.e., CC1) and has a nanosecond-scale transition latency. PC 1A is based on four key techniques. First, a hardware-based agile power management unit (APMU) rapidly detects when all cores enter a shallow core C-state (CC1) and triggers the system-level power savings control flow. Second, an IO Standby Mode (IOSM) places IO interfaces (e.g., PCIe, DMI, UPI, DRAM) in shallow (nanosecond-scale transition latency) low-power modes. Third, a CLM Retention (CLMR) mode rapidly reduces the CLM (Cache-and-home-agent, Last-level-cache, and Mesh network-on-chip) domain’s voltage to its retention level, drastically reducing its power consumption. Fourth, AgilePkgC keeps all system PLLs active in PC 1A to allow nanosecond-scale exit latency by avoiding PLL re-locking overhead. Combining these techniques enables significant power savings while requiring less than 200ns transition latency, 250x faster than existing deep package C-states (e.g., PC6), making PC 1A practical for datacenter servers.