Yes, but it would take much longer to train. By dividing the work between thousands of nodes on Cori, we could train the network to convergence in less than 20 minutes.
This is important because waiting days (or months!) to train a network it is very frustrating to scientists who want to explore their data quickly. With a turnaround time of 20min you can do lots of training runs and explore parameter space quickly.
Just imaging if you are a cosmologist who needs to wait for the results before he/she can tweak the network for improvement.. CosmoFlow really has changed how he/she can work.
That's fair! The authors of the original work (that we based this network on) could not look at more theoretical parameters or use more training data because it took too long to train the network. That's where this collaboration stepped in to help.
We can’t experiment on the universe (although it would be fun if we could!), so we use supercomputers to see what the universe would look like under different theoretical models by running simulations. In fact, we used such simulations to train the CosmoFlow network.
In this case the network learned the distribution of matter in the universe under different theoretical models, but this also applies to e.g. weather patterns, protein structures, neuron activation in the brain....
During the course of our DL work we experienced throughput issues at scale, which we were able to resolve using Cray's DataWarp I/O Accelerator. Going forward, we'd like to further investigate data management for Analytics and AI workloads at scale.
Another challenge we encountered was hyperparameter tuning. Convergence at scale can be challenging in general, and is an area in need of more study. To open up this capability more we'd like to accelerate hyperparameter tuning with automated tools.
I'd like to extend this work even further and look at more theoretical models and parameters - which would need even more training data and take even longer to train on a single node :)
@PMendygral Pete, many people are as yet unfamiliar with hyperparameter optimization. Would you happen to have a link to some good introductory material on HPO?
@tedslater "Hyperparameters" refers to any of the bits of the network and optimizer used to train the network that can be tuned. CosmoFlow uses the stochastic gradient descent algorithm to train the network. We had to manually study how to configure the algorithm for convergence
Hyperparameter optimization is what we did manually, but it's fairly inefficient compared to something that could experiment with values like learning rate and optimizer type automatically. With 20 minute turn around, automated HPO could really simplify this challenge.
Xeon Phi like the Xeon processors both have high performance cores and wide vector units which work very well for this type of workloads. When combined with Intel optimized performance libraries (such as MKL-DNN), we can deliver very good deep learning performance.
(1/2) Matter is not randomly scattered in the universe - it clumps together due to the attractive force of gravity, but it is also pushed apart since dark energy is causing space to expand.
(2/2) CosmoFlow estimates 3 parameters that describe how much matter there is in the universe, and how “clumpy” it is. These are parameters that define the physics of our universe.
Yep! Previous work looked at only 2 parameters. They found it too computationally expensive to train the network for 3, so the aim of this work was to make it easy to train a network for more parameters, which requires a lot more training data.
(1/4) The CrayPE Machine Learning Plugin, a part of the Cray Urika-XC Analytics and AI suite, improves the scalability and performance of TensorFlow distributed training.
(2/4) This capability is intended for users needing faster time to accuracy, is based on data-parallel DL training, and has a custom communication scheme specifically designed for DL training.
(3/4) TensorFlow users on Urika-XC start with a serial (non-distributed) Python training script, include a few simple lines for the Cray Plugin, and are then able to train across many nodes at very high performance.
(4/4) Using the Cray PE ML Plugin, the team was able to use 8,192 nodes to do fully synchronized data-parallel training, where previous efforts on Cori had encountered significant scaling issues.
(1/3) At a system level, a training run like CosmoFlow requires an extremely stable HPC system and scalable, high performance interconnect. Tuning the hyperparameters required many runs, and the system components and supporting software had to be fast and reliable.
(2/3) Running distributed model training isn’t just a compute problem, it’s also a data I/O problem, as each compute node has to read data in parallel.
(3/3) The Cori system at NERSC has both a native Lustre storage system and a “burst buffer” file system comprised of Cray DataWarp I/O accelerator nodes. DataWarp was critical to achieving performance at scale for CosmoFlow.
I'll second the usefulness of DataWarp. Training a deep learning network has a heavy read load on a file system, and the SSDs in Cori's Burst Buffer are a good fit for this IO pattern.
(1/2) Cori is #NERSC's flagship supercomputer - a Cray XC40 supercomputer with 2,388 Intel Xeon nodes and 9,688 Xeon Phi nodes. It has a peak performance of roughly 28 PFlop/s, and is currently the 10th most powerful computer on the planet.
(2/2) Cori is named for American biochemist Gerty Cori, the first American woman to win a #NobelPrize and the first woman to be so honored with the prize in Physiology or Medicine.