AI is driving enterprise investments in data, including in the storage infrastructure that supports distributed data-science pipelines in the hybrid cloud.
We're nearing the end of this #ActionItem CrowdChat on AI & Hybrid Storage. #Wikibon wants to thank the experts who've participated in the discussion. It's been excellent and stimulating.
It's a "it depends" answer. Obviously, you can't beat the speed of light. So, the centralized model of cloud storage doesn't meet that requirement. As you look at solutions form EMC, NetApp, Faction etc. you see the potential for distributed access.
- No. Storage is lagging behind AI Processing advancements. Just look at the number of services and options in AI and related fields in the public providers. Most of the advances in cloud storage are higher IOPS or a bit of lifecycle management.
simple answer to this question is "No". In order to get there we need to improve "software side of the storage", likes of #hadoop ecosystem have helped us but a lot need to be done. We need better storage algorithms (besides better hardware performance).
If the data originates in the cloud, the performance capabilities of storage are generally very good. However, this does require a significant amount of setup time and integration of data sources & compute/GPU resources.
But you do have compute instances that are AI-friendly. Cloud providers are offering supper fast local storage options and high-bandwidth low-latency network options to create parallel applications. So, it depends on the use case.
Also, IMO Compute and Network are getting bigger share of investment dollar but storage tech isn’t far behind. I see many startups trying to tackle this bottleneck.
cloud storage? I guess so. GCP, in particular, is providing well-regarded services. Are merchant storage solutions keeping up with merchant AI silicon? Not without a fair amount of detailed administrative work, but getting closer.
- We need software to help us push more into data lifecycles and future access patterns. One advancement I'd like to see is around user notification. If AI is applied to storage, I should be notified is interesting comes to light in the future.
When I can get a bunch high-memory, FPGA/GPU backed instances on a 10Gbps network, you can create some AI solutions not easily recreated in the enterprise.
@sarbjeetjohal - I would have said storage techs were getting lots of the investment dollars. It seems like we always hear about storage startups getting funding - though the solutions feel like only incremental advancement from the current options.
- Not sure I have a good answer but feels like it will be tied to adoption of onprem public cloud tech like AzureStack and AWS Outposts. Devs will be able to apply the same models locally that they modeled in public providers.
There are two ways that data at the edge can be used to improve functionality. The first is local data that can go to improve knowledge of the environment (e.g., pot-hole in road). The second is changes to the inference code. Compliance will make this rare!
While distributing #AI workloads, number one thing to keep in mind is manageability which IMHO can turn out to be one of the biggest cost factor for at scale #AI programs.
@sarbjeetjohal - Agreed. I can see folks training ML workloads in AWS and deploying inference models on Outposts in the interim until the edge is ready. Outputs would be an intermediate hub to collect edge data.
Storage is always an underappreciated field of endeavor.... always kind of an afterthought. We hear about the "sexy' stuff associated with AI and ML, assuming storage will somehow keep up
I'm skeptical of many of the storage solutions that are "optimized for AI". The changing state of applications definitely requires some redesign and consideration of storage.
#AI industry need to recalibrate (rather than rethink) storage architectures in the era of the hybrid cloud. Starting from using the right mix of storage (performance) for training the models vs inference workloads to performant software platforms & compression...
- Yes. Storage lifecycle and costing is typically separated based on users accessing the data (or not). I think we'll see a distinction between user access, AI access and offline / archived. AI models need to still run against data I don't think I need (yet)
@joemckendrick I think data lakes are sexy, and they're key components of the AI development and operations pipeline. They're nothing without mass storage.
If the AI industry believes that all data will reside in centralized storage pools under centralized control, yes. Lots of questions regarding how and how much data will get moved. Most likely: Much derivative AI modeling will be distributed, with big implications.
@sarbjeetjohal I agree. It's a matter of fitting the storage tech to the specific workloads in each tier of the AI pipeline deployed in public v. private clouds in hybrid architectures.
I think the cloud providers will keep up with storage requirements. It may be a tall order for enterprises onsite. But the cloud providers will always be in a race at the back-end to shore up speed and performance
no AI expert but one obvious challenge is I/O. Centralized storage systems introduce latency in getting data close to the inference and modeling engines.
- Traditional storage assumes that we know enough about the data to accurately categorize it. I think that's part of why we see an explosion in unstructured data. Data will be more in the form of media (pictures, sounds, etc) than traditional blocks.
For inference workloads at the Edge (many different types) the emphasis is on real-time in-context support of inference code running in a mesh of nodes. DRAM memory together with Flash (NVDIMMs) will be an important technology at the Edge.
@ballen_clt Really? I thought that "traditional" in storage meant data structuring--relational, columnar, file, etc--but not necessarily any deeper semantic understanding of the data.
programmability of storage will enable policy based storage allocation which can further help in #AI workloads. Not all #AI related workloads are equally demanding on storage.
For development workloads the emphasis is on large amounts of data (much of which will use HDD), & smaller amounts of active data held in flash. The optimum way of holding this data is shared mode with snapshots. Distributed data should have code moving to the data
@CTOAdvisor AI can be as centralized or distributed as you need it to be. But it's often modeling, trained, and served from highly centralized storage/compute platforms.