WhyHPCCooling

Exascale: HPC Cooling
Taking an inside look at the changing face of HPC cooling as we move towards exascale.
Cray Inc.
Q2: How do you decide between air and water cooling?
HPC_Badger
(1/3) @cray_inc There are considerations for performance and considerations for the datacenter.
Isabel Valoria Rao, P.Eng.
What are the main decision criteria?
HPC_Badger
(2/3) For performance, it’s about matching the cooling to the technologies so one is able get the most efficient use out of them. For many of today’s modern processors, that may lead us to direct liquid cooling.
HPC_Badger
(3/3) For data center considerations, it's often about reducing the costs of the cooling solution. Some have access to "free" air cooling, and others have access to water that can reduce their TCO.
Wade Doll
(1/2) There is a cross-over point where water cooling just makes more sense. This is driven by CPU requirements, desires to minimize cooling costs, and density.
Wade Doll
(2/2) Air cooling, while lower CAPEX, is at one end of the spectrum and water cooling is at the other with lower TCO.
HPC_Badger
I think that the decision criteria often comes down to the best TCO while meeting performance goals. That TCO calculation should include the costs for installing in a data center and any changes required, the acquisition costs, and the running costs for the lifetime.
CoolIT Systems
@HPC_Badger @Cray_inc Agreed. Direct Liquid Cooling delivers on all three of the key demands that are driving the data center cooling industry today: increased rack density, optimized performance and maximum energy efficiency.

(edited)

Brandon Peterson
From CoolIT's perspective, we see our customers move to liquid cooling primarily to enable high TDP processors, from a thermal perspective. This typically gets them to 60-70% heat capture into liquid. (1/2)
Einar Næss Jensen
how much difference is there regarding tco for rear door cooling (with water) vs direct node/cpu cooling
Brandon Peterson
From there, our customers look at lower density heat sources to find the right balance between % heat capture and TCO, targeting CPU VR, memory and other heat sources to push heat capture to 85%+. Some even target 100% heat capture to liquid. (2/2)
Curt Wallace
Air was never an efficient heat extractor, just an abundant one. When he cooling needs of a new system approach the cooling capacity of the DC, a different solution is needed. Retrofitting a DC is better than building a new one.
Wade Doll
@einjen RDHX gets you part of the way there. But, still having fans in the server uses up more power than directly liquid cooling with cold plates
Brandon Peterson
@einjen We find the decision for rear door vs. direct liquid cooling vs. combined rear door/direct liquid comes down to the existing infrastructure and available liquid temperatures. Each site has a different TCO model, determined by Capex and Opex requirements for that site.
Cray Inc.
Q1: What’s the biggest challenge we face in cooling systems as we approach #exascale?
Wade Doll
The infrastructure and cost of the cooling systems scale with the high power demands of an Exascale system and is a significant burden on the full system TCO
HPC_Badger
The processor performance curve plateaus dramatically with power and temperature, also, so keeping the processors in the most optimized region is particularly challenging and even more important at Exascale. Keeping all of them in the same part of that curve to provide
HPC_Badger
consistent performance across the application is really challenging.
Wade Doll
I would also add Because of data center power constraints, the power demands of the cooling system take away from the available computational power.
Brandon Peterson
@DollWade This and your other point above are interesting in regard to % heat capture into liquid for a liquid-cooled system. Has Cray done any analysis on the balance between heat capture and reduction in cooling system power consumption vs. overall cost/TCO?
Wade Doll
Yes Brandon. Basically the more you can get into the liquid closest to the heat source will result in the lowest TCO
Cray Inc.
Q6: How can effective cooling impact my overall TCO?
Wade Doll
Facilitating datacenters that want to get rid of chillers and use much more economical cooling towers. The result is the need to support much higher temperature coolants.
HPC_Badger
(1/2) More effective and efficient cooling reduces wasted power therefore reducing costs. Keeping modern processors cool also creates higher performance leading to better TCO since there is more profit and lower costs.
HPC_Badger
(2/2) . If one can do this while minimizing or obviating facility changes it is a real winner.
Jason Zeiler
By replacing the fans in CRAC, CRAH, Rear doors and server fans itself, with pumps in the CDU, the electricity used is much lower.
Curt Wallace
(1/2) Imagine being able to efficiently and effectively cool up to 100KW in a rack with 35C warm water (or 200KW with 13C chilled water). Imagine servers using ~10% less power because they don't need fans. (Hyperscalers have measured up to 22% less power draw.)
Brandon Peterson
For liquid cooled systems: Capex - reduced need to invest in equipment like chillers, large air handling systems, CRACs, CRAHs etc. Opex - reduced need to operate equipment that is still required, including server fans and remaining data center cooling equipment
Wade Doll
@CurtW_HPC Curt, this is in total alignment with Cray's thinking. But I would stretch the goal to 45C warm water to cover EU markets
Curt Wallace
(2/2) No chilled air. No hot spots. While direct-to-chip cooling is a great first step, it doesn't have quite the capacity of full immersion cooling. Immersion definitely has a few caveats, but it has mPUE <1.05.
Brandon Peterson
@DollWade In regard to EU markets, it is also more common there (than North America) to see data center heat recapture. This can also greatly benefit TCO by using data center heat to reduce the utility costs for nearby buildings.
Brandon Peterson
@CurtW_HPC CoolIT sees the liquid cooling market (both Direct and Immersion) growing significantly over the next few years. However, we focus on Direct due to the ease of integrating into existing DCs and testing that shows performance advantages enabling higher TDP processors
Cray Inc.
Q3: Does Cray’s approach to system cooling change with larger systems?
Einar Næss Jensen
what defines a "larger system"?
HPC_Badger
That's a bit subjective, and it differs by metric. By performance, I'd say about 1PF peak for some of this discussion.
HPC_Badger
(1/2) The approach doesn’t change, as the physics and the challenges are the same, but the optimization and options may differ based on scale.
HPC_Badger
(2/2) At smaller scales, perhaps it’s more about fitting an existing environment with minor changes to the data center yet looking for the best performance available.
Jason Zeiler
Too add to the question, when does rack based CDUs make more sense than row based CDU? More rack based CDU offer better granular control and redundancy at the rack, but at a higher cost.
Wade Doll
(1/2)To a degree. Small systems, say less than 256 nodes, produce a level of heat that can easily be handled by many data centers and by different methods.
Wade Doll
(2/2) When you get to thousands of nodes, it makes sense to transition to liquid cooling for the performance, cost, and density advantages
Einar Næss Jensen
so, if you're aiming for 3-4 PF, and infrastructure can deliver temeperatures between 12-25 celcius and even as high as ca 40? direct cooling?
Wade Doll
Jason, totally agree about the better granularity of rack based. But, I think when you get to large exascale systems (a lot of racks) there is some cost advantages of row based
HPC_Badger
@einjen (1/2) Good question, and it somewhat depends on the technologies involved. With some processor types, they can perform at their best even when cooled with 40C water. Others may not and might require 32C.
Brandon Peterson
@einjen Assuming you are referring to liquid temperatures up to 40C, this would likely favor direct liquid cooling. Rear doors typically require liquid temperatures in the range of 16-18C. Direct liquid cooling (depending on the system design) can handle up to ASHRAE W4, 45C.
HPC_Badger
@einjen (2/2) It is best to align the technology to the workloads while evaluating the cooling options.
Einar Næss Jensen
what does Cray prefer? hot water cooling or ice water?
Wade Doll
Cray would prefer ice water. But that is not in the best interest of the data center. it takes a lot more energy for the DC to create ice water than using "free" cooling tower methods
CoolIT Systems
@einjen from @CoolITSystems point of view, we believe warm water cooling is the more efficient way go. Less energy is spent on chilling the water.
Jason Zeiler
@DollWade Why so? If direct liquid cooling can operate efficiently at higher temperatures, is the desire for chilled water simply to allow for more uses of the cold fluid (rear door, and direct liquid cooling)
Einar Næss Jensen
ok. At our site, we use heatexchangers to heat water up to ca80-90 celcius. If I understand the process correctly, we then get cold water "for free". Would it still makes sense to use warm water cooling
Wade Doll
Jason, my comment about chilled water is just that it is an easier solution. But, not necessarily the right solution from a holistic view.
HPC_Badger
@einjen The most cost effective solution is usually to use the water at the temp it's available if it works within the specifications for the chips. Cooler water available for "free" is always a plus.
HPC_Badger
@einjen We've had some customers use cool water from neighboring lakes and some that pull cool water from underground aquifers making good use of "free" chilled water.
Einar Næss Jensen
all heated water is reused on campus at our site. Norway. Very good for chilling winter temps and rather cold summers
Einar Næss Jensen
during summer months. campus is not buying any hot water from energy company due to reusable hot water from our datacenter. saves us a few bucks. I think 1Mill$ each year If I remember correctly
CoolIT Systems
@einjen @Cray_inc CoolIT has a number of European customers doing the same thing - reusing their waste server heat to warm the nearby buildings. We highly encourage this green tech apporach.
HPC_Badger
@einjen Yes, and we at Cray want to design systems that are flexible enough to make use of what's available to a data center to both save money and save power.
Steven Shafer
@einjen Excellent point, Einar! I'm not sure any of the water block companies are trying to reclaim the waste heat for other use.
CoolIT Systems
@einjen @Cray_inc This approach can also greatly benefit TCO by using data center heat to reduce the utility costs for nearby buildings.