[LIVE CHAT] Do you need HPC-optimized OS?

What does the underlying OS really mean to you?

To Cray, the underlying OS means the entire operating environment: the kernel, OS services, daemons, and other software that provides user services. It also includes integrated interfaces to third party components like workload managers.

0 Votes Vote

[-]

Joseph B George (JBG)

Right - it's direct access to system resources, and being able to work with the OS, and sometimes enhance the OS (user mode or otherwise) allows our applications to result in better performance

0 Votes Vote

[-]

Joseph B George (JBG)

And that's why we're all doing this - better performance! :)

0 Votes Vote

[-]

Sunny

I see Scott from Altair in the crowd. Do you have a comment, Scott?

0 Votes Vote

[-]

Martijn de Vries

once containerized workload becomes more mainstream in HPC, the actual OS that is running on your nodes will become less relevant, since everything needed for your jobs to run would be in the container image.

2 Votes Vote

[-]

Joseph B George (JBG)

0 Votes Vote

[-]

Sunny

Yes, we see workloads requirements becoming more diverse and containers being an important part of supporting them.

0 Votes Vote

[-]

Sunny

Container use is increasing on HPC workloads.

1 Votes Vote

[-]

Scott Suchyta (HPC)

Agree with Martijn. The workload needs to be orchestrated such that jobs will run on the right nodes at the right time -- job schedulers will be critical in the stack

3 Votes Vote

[-]

Joseph B George (JBG)

Yes, the applications are evolving quickly - our communities are starting to think through workload management and container orchestration

0 Votes Vote

[-]

Tom Joy

@Scott_HPC do we have a scheduler that support containers? I am not sure how its identify the containers in a node...

0 Votes Vote

[-]

Scott Suchyta (HPC)

@Tom, #PBSPro supports containers. The container is part of the request for a job, and PBS will deploy the predefined container image to the node(s).

1 Votes Vote

[-]

Joseph B George (JBG)

+1 to @Scott_HPC - great that Altair is working this

0 Votes Vote

[-]

Sunny

Cray is working with Altair and others on the coming mash-up of container, WLM, and orchestration technologies that serve very broad workload requirements. We might throw provisioning in their too.

2 Votes Vote

[-]

Tom Joy

@Scott_HPC that means containers has to be mentioned in resourcedef ?

0 Votes Vote

[-]

Scott Suchyta (HPC)

#PBSPro 18.x release simplifies the container integration for sites. From user pov, requesting a container is an environment variable, qsub ... -v CONTAINER_IMAGE=name_of_container

2 Votes Vote

[-]

Scott Suchyta (HPC)

From admin pov, you can create custom resources to target specific nodes that are eligible to execute the request container

1 Votes Vote

Cray Inc.9

Is there value in packaging things together?

1 Votes Vote

[+] Show Hidden Comments

[-]

Joseph B George (JBG)

There is great value in packaging and optimizing the whole environment, which is what Cray uniquely does. The user gets a complete integrated distributions and does not have to build the environment from component parts.

0 Votes Vote

[-]

Martijn de Vries

it's valuable to be able to deploy a tuned and flexible setup so that the wheel does not have to be reinvented every time a system gets deployed

4 Votes Vote

[-]

Piush Patel

it will result in a more stable environment and makes things easier to support on mission critical compute infrastructure

1 Votes Vote

[-]

Joseph B George (JBG)

Agree - we find that, generally, administrators spend a lot of time focusing on maintenance and keeping the machines running - the more we can keep things flexible, the more customers can focus on innovation and solving key challenges

1 Votes Vote

[-]

Scott Suchyta (HPC)

@jbgeorge agree! admins spend a lot of time making sure all of the moving parts are working together. It really sucks when a component changes and breaks three other components were depending on it.

1 Votes Vote

[-]

Joseph B George (JBG)

Agree @Scott_HPC - it might be ok to do that putting your presents on Christmas morning (I've been there!), but never want to do that with your HPC system!

1 Votes Vote

Yevgeniya Perederey7

@jbgeorge What are the requirements to minimize OS noise and improve system performance? #gethpcos

2 Votes Vote

[+] Show Hidden Comments

[-]

Joseph B George (JBG)

A common question! A number of system functions run through the operating system, so it can be an area of overhead, but also a place to drive efficiency- some include things you can in the system as a whole, others are things you need to do in the kernel

1 Votes Vote

[-]

Joseph B George (JBG)

One simple way to minimize OS noise is to examine the different types of nodes that exist in your system - some are job-running compute nodes, some are service nodes - not all nodes require the same level of OS enablement!

1 Votes Vote

[-]

Joseph B George (JBG)

For example, one question we asked at Cray was "does a compute node need EVERYTHING in Linux to perform optimally?" The answer was no - so we've managed to drive efficiency into the compute node Linux, keeping more resources free for better application performance

1 Votes Vote

[-]

Sunny

At the user level, users can start with the normal things they do to improve process and MPI rank synchronization at the application level. That helps.

1 Votes Vote

[-]

Sunny

But after that, you need an operating environment that is composed to do this.

0 Votes Vote

[-]

Alison Paisley

@jbgeorge Different but related...what does allocating jobs to specific nodes do?

1 Votes Vote

[-]

Joseph B George (JBG)

Great question - if you think about how an HPC cluster is architected, you can have various nodes types throughout, varying from processor types, newer models of servers, etc

0 Votes Vote

[-]

Joseph B George (JBG)

Some nodes may have a better memory profile - better suited for memory intensive applications. Some nodes may have newer processor types and some applications can drive better performance. Being able to specify the nodes the job runs on means a better app run - which is huge.

0 Votes Vote

Piush Patel5

How does the collaboration between Cray and your partners' engineering teams (on an optimized HPC OS like CLE) benefit customers?

2 Votes Vote

[+] Show Hidden Comments

[-]

Sunny

Our engineering teams collaborate to create a seamless integration between CLE and SW like PBS Pro.

1 Votes Vote

[-]

Sunny

The resulting collaboration produces better scalabilty, performance, quality, and reliability.

0 Votes Vote

[-]

Joseph B George (JBG)

IMHO the ecosystem is critical to see progress - there are a variety of use cases, various permutations of applications + mgmt. software + processor types + locations (cloud, on prem, etc) - collaboration between partners is critical - and the customer benefits the most

0 Votes Vote

[-]

Joseph B George (JBG)

Your perspective, Piush?

0 Votes Vote

[-]

Scott Suchyta (HPC)

Strong collaboration also results in shorter time to market. Partners don't have to wait for @cray_inc to deliver a feature and then the partner starts working on the integration. Very important for customers wanting bleeding edge solutions

1 Votes Vote

[-]

Piush Patel

agree with scott and sunny!

1 Votes Vote

Paul Rosien4

What can be done to reduce system jitter? #GetHPCOS

2 Votes Vote

[+] Show Hidden Comments

[-]

Joseph B George (JBG)

Great question - and for those who are not familiar with the term, jitter can be seen as latency in the system. Addressing jitter can result in far better overall performance of the jobs. Re: what can be done...

1 Votes Vote

[-]

Joseph B George (JBG)

There are some things you can do, like ensure you're using a high speed interconnect vs something more standard. However, there are other things that you need to jump into the kernel and modify to reduce jitter

0 Votes Vote

[-]

Joseph B George (JBG)

At @cray_inc, we have found that tweaking internal process synchronization and memory utilization, as well as tighter execution paths in the OS, have been great to helping this

1 Votes Vote

Scott Suchyta (HPC)3

To @HPC_sunny comment about using @cray_inc programming environment. What is Cray's view on using commercial and/or open source in the HPCOS? Assume obvious response.. right tool for job, but is there other criteria for selecting the tools?

1 Votes Vote

[+] Show Hidden Comments

[-]

Sunny

:-) Since we build on Linux, we already do, Agree with your comment. Beyond the distros....

0 Votes Vote

[-]

Sunny

we do incorporate upstream community /open source code...and also contribute some of it back to the community

1 Votes Vote

[-]

Joseph B George (JBG)

Criteria would include does the application adequately support the OS, what are the admins trained on, what are existing tools integrated with, etc

1 Votes Vote

Cray Inc.3

Can you provide suggestions for how to improve application scalability?

0 Votes Vote

[+] Show Hidden Comments

[-]

Sunny

Pay attention to your use of MPI collectives like barrier, reduce, allreduce, and bcast. Ensure you are utliizing libraries that are optimized for HPS. Profile your code using tools in Cray PE.

0 Votes Vote

[-]

Sunny

The Cray profiler can analyze 100k MPI ranks.

0 Votes Vote

[-]

Martijn de Vries

depends on whether your application just needs to do number-crunching (in which case optimizing MPI is the answer). here's a thought if it's not just number-crunch: develop it like a cloud-native application using 12 factor design principles

2 Votes Vote

[-]

Joseph B George (JBG)

And from an OS perspective, 1) the notion of deploying jobs to the best nodes for the workload (memory, processor, etc) helps a lot and 2) using lightweight OSs at the compute node, both help immensely in scalability and performance

0 Votes Vote

[-]

Scott Suchyta (HPC)

@jbgeorge sounds like you need a job scheduler to figure out the when and where ;-)

1 Votes Vote

[-]

Joseph B George (JBG)

And to Sunny's point, there is a lot you can do with a flexible programming environment

0 Votes Vote

Cray Inc.3

Is being able to build the images critically important for any particular reason?

0 Votes Vote

[+] Show Hidden Comments

[-]

Sunny

Having the tools to build and manage images allows you to more easily adapt/re-image portions or all of your system as needed: for example, in response to dynamic workload requirements, without having to reinstall software from scratch. Images can be created independently.

0 Votes Vote

[-]

Joseph B George (JBG)

Yes! I like to think in terms of the application - what resources does the application need, how do we get the results faster, etc. Being able to build images helps us maximize this

0 Votes Vote

[-]

Joseph B George (JBG)

Others in the audience - your thoughts?

0 Votes Vote

[-]

Sunny

You can build images at any time. Create a library for later use.

0 Votes Vote

[-]

Martijn de Vries

it's important to be able to switch between images quickly so that your nodes can be tailored for a particular type of jobs.