GetCrayAI

AI: Make the Right Choices
Find out what you need to know before, during and after deploying your own AI technology solution.
Cray Inc.
Q #1: What are some of the common mistakes IT organizations are making when deciding on infrastructure for an AI PoC/Deployment?
Rangan Sukumar
(1/2) Decisions are made on price and/or hype and short-term budgets. Investments in AI infrastructure are not considering data and model lifecycle management. Infrastructure lock-in based on price and hype locks-out future-proofing and user-productivity.
Rajesh Anantharaman
Many organizations make the mistake of equating AI only with Deep Learning, and as a result they invest in AI infrastructure that supports only Deep Learning.
Rangan Sukumar
(2/2) Organizations are unable to find facts around AI infrastructure investments: (e.g.: Buying vs. Renting, Component-integration vs. System-integration, the value of supported hardware and software vs. doing-it-yourself.)
Rajesh Anantharaman
Although Deep Learning is an important part of AI, it is typically only a part of a broader AI workflow and also only one choice of model from a variety of other practical ML models
Rajesh Anantharaman
Beyond compute, storage is a very important consideration throughout the workflow as well as software that allows you to move through the workflow seamlessly to manage data and build various model types.
Rajesh Anantharaman
The rapidly evolving landscape of hardware and software in AI makes it essential to invest in broad infrastructure to “future-proof” hardware along with a supported and updated software stack
Rangan Sukumar
@aaronrhoden Mind sharing your experience with "build your own" ?
Rajesh Anantharaman
@aaronrhoden That's a great point. This is one of the reasons Cray came up with Accel AI reference configs that have already been architected with best practice AI workflows from our experience.
Aaron A. Rhoden
(1/2) @Rangan_Sukumar Customers have a couple of resources from data warehousing who feel they can pull off AI with some x86s servers with pci slots for gpus. It is possible, but the duration from 0 to 100% is so long the problem, purpose and value of the solutions have changed.
Aaron A. Rhoden
(2/2) Without software stacks determined, and using some O'Reilly books alone is not the way to go. The business will never trust the resources again, and the mere mention of AI, DL or ML becomes a sore spot.
Rajesh Anantharaman
@aaronrhoden Agreed that for many customers, having an experienced consultant/solution architect can help greatly in their first project win so they continue to invest in AI/ML/DL
Rangan Sukumar
@aaronrhoden Excellent point! There are gaps between the value of the business problem and the value of data. Honest AI prototypes without the hype may be the way to avoid another AI winter.
Cray Inc.
Q #4: What are my options if I want to try some of this out?
Aaron A. Rhoden
1. call cray 2. call Sirius. shameless plug

(edited)

Rajesh Anantharaman
Cray has multiple offerings to help customers get started in AI. We have an Accel AI Lab where you can get access to our hardware and software and try out some of your workloads before you invest in a system.
Rajesh Anantharaman
Cray also has multiple reference configurations for different stages of the AI journey - we call these Accel AI configs. We have a one node config for you to get started, a prototype config when you have a small team to support for the Ai workflow
Aaron A. Rhoden
..but seriously, i want to get more hands-on with the toolsets like tensorflow, caffe, caffe2, etc. to really know what i am recommending to customers. i want to be able to sit in their seat for a few, so when i sit alongside them i can be of help.
Rajesh Anantharaman
and a production config when you want to scale out to support a larger team for the AI workflow
Aaron A. Rhoden
i am also curious about the Cray unique software contributions to the commercial set.
Rajesh Anantharaman
@aaronrhoden NVIDIA and Cray also offer some DLI workshops where you can get some hands on experience with TF and building some neural network models. You can also take some online courses on coursera or udacity if you really want to get into it :)
Rajesh Anantharaman
@aaronrhoden Cray offers a Urika-CS software suite which is a pre-integrated software stack with tools for the AI workflow. We integrated Spark, Tensorflow and other libraries and made sure they work well together so that customers can save time and get started.
Rajesh Anantharaman
@aaronrhoden We have optimized our software stack to run on heterogeneous compute resources and hybrid storage so you can utilize all the infrastructure you have invested in.
Rajesh Anantharaman
@aaronrhoden We also bring our supercomputing experience into the software stack in order to run distributed training across a large number of heterogeneous nodes with >90% efficiency.
Aaron A. Rhoden
@RajeshAnanthara Thanks! I lobbed that one over the net for you. :-)
Aaron A. Rhoden
@RajeshAnanthara HPC re-purposed for model training sounds amazing.
Cray Inc.
Q #2: What is the entire workflow?
Ted Slater
Workflows are many and varied, but they have some common stages. Data acquisition and "clean-up" come first, and can take 60-80% of a data scientists time.
Ted Slater
In the middle you'll see model development. This can be very computationally intense, depending upon what you're doing. Deep learning, in particular, requires lots of data and a significant amount of compute power to accomplish.
Ted Slater
In the end, you're looking for real insight from all of that work. In deep learning, this is the "inference" phase, where new data come into the model and your hard-earned results come out.
Ted Slater
Workflows can be iterative, where results come out and feed back in to an earlier stage of your pipeline to improve results.
Rangan Sukumar
There can be more to the “entire workflow” - the ability to integrate new datasets, associating new labels to new and existing data, conducting A/B tests to make sure the model is current, and triggering auto-tuning jobs to retrain model parameters to new data and behaviors.
Rajesh Anantharaman
AI workflows are also highly iterative in nature – you need to constantly iterate between data prep and model development in order to get a good performing model.
Rajesh Anantharaman
You also need to iterate constantly across the workflow as models go into production and new data comes in and you need to update the models.
Ted Slater
You can see that workflows, start to finish, can be complex, and your compute architecture (compute, storage, etc.) has got to be up for the difficult things as well as the easy things.
Aaron A. Rhoden
@Cray_Inc Though there is no standard iterative model, are there any best practices beyond data set collection, featurisation, and the model training, model test loops to make a best model?
Aaron A. Rhoden
how do we in tech sales/consulting help customers understand differences between ai/ml/dl and big data analytics?
Rangan Sukumar
@aaronrhoden The best practices are constantly in flux depending with the fast-paced AI world. That said, there are tools emerging for automating the workflow itself with considerations to model versioning, provenance, etc.
Ted Slater
@aaronrhoden Hey, Aaron, it's a good question. At a lower level, like the DL level, it's pretty easy: you've got a neural network with a bunch of hidden layers, or you don't. DL usually needs a lot of (often labeled) data, so it's a Big Data thing most of the time.
Ted Slater
@aaronrhoden Once you get up a level or two, the distinctions become a little less important. What's AI and what's not? It really doesn't matter that much -- we're just trying to get some work done. ;-)
Rajesh Anantharaman
@aaronrhoden The difference between AI/ML/DL and Big Data sometimes tends to be based on tools and ecosystem. Big Data tends to be on the Hadoop ecosystem, versus AI is on the TF/Caffe2 ecosystem. Spark, however, moonlights between the two
Ted Slater
@aaronrhoden And Big Data is often just what you make of it. Some data sets can be small, but incredibly "dense" in some ways (like some graphs, for example). These aren't big, but they'll challenge your compute environment so they need to be treated like Big Data.
Aaron A. Rhoden
@Rangan_Sukumar Variation is expected. I am optimistic that Cray can lead the automations and potential standards, etc. you mention.
Aaron A. Rhoden
@tedslater Thanks, Ted. "We choose to go the moon...because it is hard."
Rangan Sukumar
@aaronrhoden @aaronrhoden There is no AI without data and no DL without Big Data. If one is allowed to hand wave , AI/ML/DL is the like a toolbox (magic wand !) to discover patterns in Big Data.
Rajesh Anantharaman
@aaronrhoden Also, another difference is that Big Data tends to be about collecting and finding insights in data, versus AI goes beyond to making sense or finding patterns in the data
Aaron A. Rhoden
@Rangan_Sukumar So I have been on track in that outputs from Big Data analytics can be inputs to these other forms (algorithms) of computing. I write this because a number of customers have hadoop (good and bad implementations). I want to inject AI, etc to add more value.
Aaron A. Rhoden
@RajeshAnanthara Totally with you on the pattern matching concept. Couple that with actionable steps (pre-programmed or alerting) and we have something that can be used for good.
Rangan Sukumar
@aaronrhoden Well said. Its all about value and extracting it from the Big Data.
Tami Wessley
Are there any particular phases of the workflow that are likely to be bottlenecks?
Rajesh Anantharaman
Different workflows have different bottlenecks, it is difficult to generalize bottlenecks across different use cases and workflows.
Rajesh Anantharaman
Two stages in the data science workflow that we see bottlenecks are model training, particularly deep learning model training, and data preparation.
Rajesh Anantharaman
Data acquisition and data preparation could be a bottleneck for use cases where data are hard to acquire and require a lot of cycles to label and prepare.
Rajesh Anantharaman
Some use cases need to train as fast as possible to get the best possible model at the highest possible accuracy, and trying lots of different models with different parameters becomes a huge bottleneck.
Rajesh Anantharaman
Some use cases need a lot of iteration between model development and production deployment as the business and technical constraints in production dictate different tradeoffs for model development (both in terms of the model and actual software implementation), which can be a
Rajesh Anantharaman
Therefore, based on your usecase and workflow, you need to carefully consider your storage and compute infrastructure as well as your software stack
Rangan Sukumar
(1/3) Depends a lot on the organization and the type of organization, types of data being used etc. Seconding @Tedslater @RajeshAnanthara, for most workflows, data cleaning and preparation is the bottleneck.
Rangan Sukumar
(2/3) For others moving data back and forth from a data source to compute cores, creating new labels to make DL work for them. Some organizations complain they cannot train models fast enough and others find creating new neural architectures.
Rangan Sukumar
(3/3) We also hear the need for expertise to be able to understand and solve the bottlenecks that are unique to the data and the organization.