[LIVE CHAT] AI: Exploring the Universe

Paul Hahn4

All of this sounds interesting. Where can I find out more?

1 Votes Vote

[+] Show Hidden Comments

[-]

Debbie Bard

If you're interested in more technical details, the CosmoFlow paper is available here: https://arxiv.org/ab...

1 Votes Vote

[-]

Debbie Bard

Also you can see us at #SuperComputing this year in November!

0 Votes Vote

[-]

Victor Lee

You can read about HPC at Intel at @IntelHPC

1 Votes Vote

[-]

Debbie Bard

This article is a good layperson's intro to this work: http://www.nersc.gov...

1 Votes Vote

Paul Hahn3

Could someone use deep learning without a supercomputer to do this?

0 Votes Vote

[+] Show Hidden Comments

[-]

Ted Slater

Yes, but it would take much longer to train. By dividing the work between thousands of nodes on Cori, we could train the network to convergence in less than 20 minutes.

0 Votes Vote

[-]

Ted Slater

@VictorL14024517 did some hand calculation, and found that if we did this on a single node, it could take up to 3 months :-)

0 Votes Vote

[-]

Pete Mendygral

Training the CosmoFlow network on a single node would take up to 3 months.

0 Votes Vote

[-]

Ted Slater

@PMendygral That kind of time would be a little discouraging. ;-)

0 Votes Vote

[-]

Debbie Bard

This is important because waiting days (or months!) to train a network it is very frustrating to scientists who want to explore their data quickly. With a turnaround time of 20min you can do lots of training runs and explore parameter space quickly.

(edited)

1 Votes Vote

[-]

Victor Lee

Just imaging if you are a cosmologist who needs to wait for the results before he/she can tweak the network for improvement.. CosmoFlow really has changed how he/she can work.

1 Votes Vote

[-]

Paul Hahn

So without this approach, is it fair to say it just couldn't be done in a reasonable time?

0 Votes Vote

[-]

Debbie Bard

That's fair! The authors of the original work (that we based this network on) could not look at more theoretical parameters or use more training data because it took too long to train the network. That's where this collaboration stepped in to help.

1 Votes Vote

Cray Inc.3

Q4: Before the use of AI how were supercomputers used in cosmology?

1 Votes Vote

[+] Show Hidden Comments

[-]

Debbie Bard

We can’t experiment on the universe (although it would be fun if we could!), so we use supercomputers to see what the universe would look like under different theoretical models by running simulations. In fact, we used such simulations to train the CosmoFlow network.

1 Votes Vote

[-]

Pete Mendygral

Supercomputers are also used to analyze massive quantities of data from observatories built to study the Universe

1 Votes Vote

Cray Inc.2

Q14: How do the CosmoFlow results advance the field of cosmology and pave the way for further research?

0 Votes Vote

[+] Show Hidden Comments

[-]

Debbie Bard

CosmoFlow is a further demonstration of the value of deep learning for learning the complicated patterns that appear in nature.

0 Votes Vote

[-]

Debbie Bard

In this case the network learned the distribution of matter in the universe under different theoretical models, but this also applies to e.g. weather patterns, protein structures, neuron activation in the brain....

1 Votes Vote

[-]

Ted Slater

@debbiebard The chemical spaces of drug targets and drug compounds are also at hilariously big scales. Could be a great application space.

1 Votes Vote

Cray Inc.2

Q13: What challenges did you experience with CosmoFlow that you’d like to take on for your next steps?

0 Votes Vote

[+] Show Hidden Comments

[-]

Ted Slater

During the course of our DL work we experienced throughput issues at scale, which we were able to resolve using Cray's DataWarp I/O Accelerator. Going forward, we'd like to further investigate data management for Analytics and AI workloads at scale.

0 Votes Vote

[-]

Pete Mendygral

Another challenge we encountered was hyperparameter tuning. Convergence at scale can be challenging in general, and is an area in need of more study. To open up this capability more we'd like to accelerate hyperparameter tuning with automated tools.

1 Votes Vote

[-]

Debbie Bard

I'd like to extend this work even further and look at more theoretical models and parameters - which would need even more training data and take even longer to train on a single node :)

1 Votes Vote

[-]

Ted Slater

@PMendygral Pete, many people are as yet unfamiliar with hyperparameter optimization. Would you happen to have a link to some good introductory material on HPO?

0 Votes Vote

[-]

Pete Mendygral

@tedslater "Hyperparameters" refers to any of the bits of the network and optimizer used to train the network that can be tuned. CosmoFlow uses the stochastic gradient descent algorithm to train the network. We had to manually study how to configure the algorithm for convergence

0 Votes Vote

[-]

Pete Mendygral

Hyperparameter optimization is what we did manually, but it's fairly inefficient compared to something that could experiment with values like learning rate and optimizer type automatically. With 20 minute turn around, automated HPO could really simplify this challenge.

0 Votes Vote

[-]

Pete Mendygral

Yellowfin is one example of such a tool The paper can be found here https://arxiv.org/ab....

0 Votes Vote

Cray Inc.2

Q8: Why are Xeon Phi processors good for this kind of work?

0 Votes Vote

[+] Show Hidden Comments

[-]

Victor Lee

Deep learning training and inference are very regular workloads. The 3D Convolutions used for Cosmoflow have very high compute density

0 Votes Vote

[-]

Victor Lee

Xeon Phi like the Xeon processors both have high performance cores and wide vector units which work very well for this type of workloads. When combined with Intel optimized performance libraries (such as MKL-DNN), we can deliver very good deep learning performance.

2 Votes Vote

Cray Inc.2

Q3: Can you describe the measures that CosmoFlow is estimating?

0 Votes Vote

[+] Show Hidden Comments

[-]

Debbie Bard

(1/2) Matter is not randomly scattered in the universe - it clumps together due to the attractive force of gravity, but it is also pushed apart since dark energy is causing space to expand.

0 Votes Vote

[-]

Debbie Bard

(2/2) CosmoFlow estimates 3 parameters that describe how much matter there is in the universe, and how “clumpy” it is. These are parameters that define the physics of our universe.

1 Votes Vote

[-]

Victor Lee

I recalled that this is the first time 3 parameters are being estimated simultaneously. Is that right?

0 Votes Vote

[-]

Debbie Bard

Yep! Previous work looked at only 2 parameters. They found it too computationally expensive to train the network for 3, so the aim of this work was to make it easy to train a network for more parameters, which requires a lot more training data.

1 Votes Vote

Cray Inc.1

Q11: What is the CrayPE ML Plugin and how did it enhance distributed deep learning training for CosmoFlow?

0 Votes Vote

[+] Show Hidden Comments

[-]

Pete Mendygral

(1/4) The CrayPE Machine Learning Plugin, a part of the Cray Urika-XC Analytics and AI suite, improves the scalability and performance of TensorFlow distributed training.

0 Votes Vote

[-]

Pete Mendygral

(2/4) This capability is intended for users needing faster time to accuracy, is based on data-parallel DL training, and has a custom communication scheme specifically designed for DL training.

0 Votes Vote

[-]

Pete Mendygral

(3/4) TensorFlow users on Urika-XC start with a serial (non-distributed) Python training script, include a few simple lines for the Cray Plugin, and are then able to train across many nodes at very high performance.

1 Votes Vote

[-]

Pete Mendygral

(4/4) Using the Cray PE ML Plugin, the team was able to use 8,192 nodes to do fully synchronized data-parallel training, where previous efforts on Cori had encountered significant scaling issues.

0 Votes Vote

[-]

Ted Slater

@PMendygral This was the largest-ever deployment of TensorFlow on CPUs!

0 Votes Vote

Cray Inc.1

Q10: Besides building the Cori system, are there any other capabilities that came to bear on the problem?

0 Votes Vote

[+] Show Hidden Comments

[-]

Pete Mendygral

(1/3) At a system level, a training run like CosmoFlow requires an extremely stable HPC system and scalable, high performance interconnect. Tuning the hyperparameters required many runs, and the system components and supporting software had to be fast and reliable.

0 Votes Vote

[-]

Pete Mendygral

(2/3) Running distributed model training isn’t just a compute problem, it’s also a data I/O problem, as each compute node has to read data in parallel.

1 Votes Vote

[-]

Pete Mendygral

(3/3) The Cori system at NERSC has both a native Lustre storage system and a “burst buffer” file system comprised of Cray DataWarp I/O accelerator nodes. DataWarp was critical to achieving performance at scale for CosmoFlow.

0 Votes Vote

[-]

Debbie Bard

I'll second the usefulness of DataWarp. Training a deep learning network has a heavy read load on a file system, and the SSDs in Cori's Burst Buffer are a good fit for this IO pattern.

0 Votes Vote

Cray Inc.1

Q7: What is the Cori system?

0 Votes Vote

[+] Show Hidden Comments

[-]

Debbie Bard

(1/2) Cori is #NERSC's flagship supercomputer - a Cray XC40 supercomputer with 2,388 Intel Xeon nodes and 9,688 Xeon Phi nodes. It has a peak performance of roughly 28 PFlop/s, and is currently the 10th most powerful computer on the planet.

(edited)

0 Votes Vote

[-]

Debbie Bard

(2/2) Cori is named for American biochemist Gerty Cori, the first American woman to win a #NobelPrize and the first woman to be so honored with the prize in Physiology or Medicine.

(edited)

1 Votes Vote

CosmoFlowAI

Stream Ended

CosmoFlowAI

Invite People to #CosmoFlowAI

1. Select Contacts

2. Compose Message

3. Send

Invite to #CosmoFlowAI

Stream Ended

Extend Time Prompt

How many minutes would you like to add?