CosmoFlowAI

AI: Exploring the Universe
Supercomputing + AI: Understanding the Universe
Paul Hahn
All of this sounds interesting. Where can I find out more?
Debbie Bard
If you're interested in more technical details, the CosmoFlow paper is available here: https://arxiv.org/ab...
Debbie Bard
Also you can see us at #SuperComputing this year in November!
Victor Lee
You can read about HPC at Intel at @IntelHPC
Debbie Bard
This article is a good layperson's intro to this work: http://www.nersc.gov...
Paul Hahn
Could someone use deep learning without a supercomputer to do this?
Ted Slater
Yes, but it would take much longer to train. By dividing the work between thousands of nodes on Cori, we could train the network to convergence in less than 20 minutes.
Ted Slater
@VictorL14024517 did some hand calculation, and found that if we did this on a single node, it could take up to 3 months :-)
Pete Mendygral
Training the CosmoFlow network on a single node would take up to 3 months.
Ted Slater
@PMendygral That kind of time would be a little discouraging. ;-)
Debbie Bard
This is important because waiting days (or months!) to train a network it is very frustrating to scientists who want to explore their data quickly. With a turnaround time of 20min you can do lots of training runs and explore parameter space quickly.

(edited)

Victor Lee
Just imaging if you are a cosmologist who needs to wait for the results before he/she can tweak the network for improvement.. CosmoFlow really has changed how he/she can work.
Paul Hahn
So without this approach, is it fair to say it just couldn't be done in a reasonable time?
Debbie Bard
That's fair! The authors of the original work (that we based this network on) could not look at more theoretical parameters or use more training data because it took too long to train the network. That's where this collaboration stepped in to help.
Cray Inc.
Q4: Before the use of AI how were supercomputers used in cosmology?
Debbie Bard
We can’t experiment on the universe (although it would be fun if we could!), so we use supercomputers to see what the universe would look like under different theoretical models by running simulations. In fact, we used such simulations to train the CosmoFlow network.
Pete Mendygral
Supercomputers are also used to analyze massive quantities of data from observatories built to study the Universe
Cray Inc.
Q14: How do the CosmoFlow results advance the field of cosmology and pave the way for further research?
Debbie Bard
CosmoFlow is a further demonstration of the value of deep learning for learning the complicated patterns that appear in nature.
Debbie Bard
In this case the network learned the distribution of matter in the universe under different theoretical models, but this also applies to e.g. weather patterns, protein structures, neuron activation in the brain....
Ted Slater
@debbiebard The chemical spaces of drug targets and drug compounds are also at hilariously big scales. Could be a great application space.
Cray Inc.
Q13: What challenges did you experience with CosmoFlow that you’d like to take on for your next steps?
Ted Slater
During the course of our DL work we experienced throughput issues at scale, which we were able to resolve using Cray's DataWarp I/O Accelerator. Going forward, we'd like to further investigate data management for Analytics and AI workloads at scale.
Pete Mendygral
Another challenge we encountered was hyperparameter tuning. Convergence at scale can be challenging in general, and is an area in need of more study. To open up this capability more we'd like to accelerate hyperparameter tuning with automated tools.
Debbie Bard
I'd like to extend this work even further and look at more theoretical models and parameters - which would need even more training data and take even longer to train on a single node :)
Ted Slater
@PMendygral Pete, many people are as yet unfamiliar with hyperparameter optimization. Would you happen to have a link to some good introductory material on HPO?
Pete Mendygral
@tedslater "Hyperparameters" refers to any of the bits of the network and optimizer used to train the network that can be tuned. CosmoFlow uses the stochastic gradient descent algorithm to train the network. We had to manually study how to configure the algorithm for convergence
Pete Mendygral
Hyperparameter optimization is what we did manually, but it's fairly inefficient compared to something that could experiment with values like learning rate and optimizer type automatically. With 20 minute turn around, automated HPO could really simplify this challenge.
Pete Mendygral
Yellowfin is one example of such a tool The paper can be found here https://arxiv.org/ab....
Cray Inc.
Q8: Why are Xeon Phi processors good for this kind of work?
Victor Lee
Deep learning training and inference are very regular workloads. The 3D Convolutions used for Cosmoflow have very high compute density
Victor Lee
Xeon Phi like the Xeon processors both have high performance cores and wide vector units which work very well for this type of workloads. When combined with Intel optimized performance libraries (such as MKL-DNN), we can deliver very good deep learning performance.
Cray Inc.
Q3: Can you describe the measures that CosmoFlow is estimating?
Debbie Bard
(1/2) Matter is not randomly scattered in the universe - it clumps together due to the attractive force of gravity, but it is also pushed apart since dark energy is causing space to expand.
Debbie Bard
(2/2) CosmoFlow estimates 3 parameters that describe how much matter there is in the universe, and how “clumpy” it is. These are parameters that define the physics of our universe.
Victor Lee
I recalled that this is the first time 3 parameters are being estimated simultaneously. Is that right?
Debbie Bard
Yep! Previous work looked at only 2 parameters. They found it too computationally expensive to train the network for 3, so the aim of this work was to make it easy to train a network for more parameters, which requires a lot more training data.
Cray Inc.
Q11: What is the CrayPE ML Plugin and how did it enhance distributed deep learning training for CosmoFlow?
Pete Mendygral
(1/4) The CrayPE Machine Learning Plugin, a part of the Cray Urika-XC Analytics and AI suite, improves the scalability and performance of TensorFlow distributed training.
Pete Mendygral
(2/4) This capability is intended for users needing faster time to accuracy, is based on data-parallel DL training, and has a custom communication scheme specifically designed for DL training.
Pete Mendygral
(3/4) TensorFlow users on Urika-XC start with a serial (non-distributed) Python training script, include a few simple lines for the Cray Plugin, and are then able to train across many nodes at very high performance.
Pete Mendygral
(4/4) Using the Cray PE ML Plugin, the team was able to use 8,192 nodes to do fully synchronized data-parallel training, where previous efforts on Cori had encountered significant scaling issues.
Ted Slater
@PMendygral This was the largest-ever deployment of TensorFlow on CPUs!
Cray Inc.
Q10: Besides building the Cori system, are there any other capabilities that came to bear on the problem?
Pete Mendygral
(1/3) At a system level, a training run like CosmoFlow requires an extremely stable HPC system and scalable, high performance interconnect. Tuning the hyperparameters required many runs, and the system components and supporting software had to be fast and reliable.
Pete Mendygral
(2/3) Running distributed model training isn’t just a compute problem, it’s also a data I/O problem, as each compute node has to read data in parallel.
Pete Mendygral
(3/3) The Cori system at NERSC has both a native Lustre storage system and a “burst buffer” file system comprised of Cray DataWarp I/O accelerator nodes. DataWarp was critical to achieving performance at scale for CosmoFlow.
Debbie Bard
I'll second the usefulness of DataWarp. Training a deep learning network has a heavy read load on a file system, and the SSDs in Cori's Burst Buffer are a good fit for this IO pattern.
Cray Inc.
Q7: What is the Cori system?
Debbie Bard
(1/2) Cori is #NERSC's flagship supercomputer - a Cray XC40 supercomputer with 2,388 Intel Xeon nodes and 9,688 Xeon Phi nodes. It has a peak performance of roughly 28 PFlop/s, and is currently the 10th most powerful computer on the planet.

(edited)

Debbie Bard
(2/2) Cori is named for American biochemist Gerty Cori, the first American woman to win a #NobelPrize and the first woman to be so honored with the prize in Physiology or Medicine.

(edited)