localdata

Managing Billions of Files
What kinds of data management & backup challenges creep up when your NAS gets to billions of files?
   7 years ago
#localdataSpawning a new hybrid cloudBeyond just moving legacy apps to the public cloud what about bringing cloud services to #LocalData?
   7 years ago
#LocalDataTiering to Public CloudHow to think about and manage tiering from the private data center to public cloud.
John Furrier
Q5. What does your current lifecycle look like between snapshots, archives, and longer term backups? Do you delete the data when it gets old?
John Furrier
Who deletes data is another question? serious question?
Sagi Brody
This is based on the governing regulatory requirements of the user and vertical. Some like #HIPAA dont have an exact de-facto standard and are up to interpretation.
Sagi Brody
People who get billed per GB :)
jeff dinisco
long term snaps are still pricey and don't protect against a variety of scenarios leaving traditional backup as a necessary evil in most enterprises
Sagi Brody
I believe surveillance and body cam recordings is only going to push out the storage retention tail required..
Bryan Champagne
Compliance and policy determine deletion. We see people struggle with actually adhering to deletion policies unless they have it automated
Dave Vellante
I've never seen a successful, sustainable example of deleting data/reclaiming wasted space
Chris Dagdigian
super brittle at the moment. No consistent patterns. Some scientific storage is too big for cost effective full-lifecycle mgmt so many have just 1 primary and 1 last-resort-backup-and-archive copy of their data.
jeff dinisco
@furrier totally agree, I meet very few customers willing to push the delete button, expiration maybe, active delete, don't see it
Nick Kirsch
Ironically, lots of machine-generated data /could/ be deleted and re-generated, but the software gets in the way.
Sagi Brody
What about real-time stream data analytics? We are also seeing this reduce the requirement to store data because of its real time nature..
Dave Vellante
@webairsagi love the steaming example - it's almost a new workload - batch, interactive, streaming
Christian Smith
Common retention policies - 1 week on primary (snapshots), 3+ months on secondary. Never delete.
Nick Kirsch
@webairsagi It seems that many of these streaming workloads will also be related to machine-learning and anything with ML will be data retention heavy...
Dave Vellante
@webairsagi Streaming App Value Chain---Ingest->Explore->Process (Predict)->Serve-> ?Persist?
Dave Vellante
@nkirsch I think about a smart meter stream with data that doesn't change - don't persist - vs weather data which I might persist
Nick Kirsch
Isn't there some value in that smart meter stream? Anything with potential predictive benefit, coupled with low cost of storage, seems likely to be persisted.
John Furrier
Brent M. Piatti ‏@BrentPiatti said
Truth! Problem is most organizations see data storage as a cost center rather than a potential gold mine.
John Furrier
Q3: How do you know what you have? Is your data useful? Do people still access it after 3 to 5 years?
Stuart Miniman
how much are people leveraging vs simply saving for compliance?
jeff dinisco
old adage that it's cheaper to save than delete still holds true with < 1PB data sets
Bryan Champagne
This is what we discuss daily with our customers right now. Access after time is a funny thing as the data will be there but whether it can be read is another question
Sagi Brody
Enforce chargeback rules inside the organization and run your IaaS or internal infrastructure as a multi-tenant system. This forces accountability/ownership of data to steakholders. not to mention makes you scalable and is good for security..
Sagi Brody
Oh yea, and analytics help. You're seeing lots of storage platforms build in analytics into the storage platform itself now.
Nick Kirsch
I continue to see many people implement side-band meta-data stores, using tree walks, ElasticSearch, proprietary MAMs, etc. to try and keep track of everything.
Chris Dagdigian
in pharma/biotech the results are pretty consistent. Nobody touches 98% of data 30 days after it was created. Lack of data awareness compounded by the fact that human data curators are more expensive than just adding capacity to a tier and punting
jeff dinisco
traditional fs attributes, e.g access times, aren't incredibly useful in most cases, tagging on the way in is the key
Nick Kirsch
In an online/nearline archive data is often a different file system or mount point - users are trained or led on how to recovery data themselves. If secondary storage is tape, that typically requires a ticket.
Chris Dagdigian
The funny thing is that storage cost are more expensive than some lab/science experiments so in biotech/pharma it is common to DELETE primary data because it is cheaper to rerun the experiment than manage data through a long lifecycle
John Furrier
If someone doesn't know where their data is then it's trouble time
jeff dinisco
@nkirsch totally agree, wondering at what scale tree walks just become impossible though, thoughts?
Chris Dagdigian
- Lots of good filesystem crawlers out there like https://sourceforge.... etc.
Nick Kirsch
@dinisco tree walks are like backup windows - they almost always take too long and have too much impact on the primary system...
John Furrier
Q1: What technologies are storing the unstructured data you are seeing today? What is the relative breakdown across SAN, NAS, on-premises object, and public cloud that you are exposed to?
Bryan Champagne
We have been seeing a combination of NAS, on-prem, or public cloud depending on where the customer is in their adoption cycle of cloud technologies
Sagi Brody
Because of the requirement for HA Virtualization, SAN is still highly used and very important. However the majority of 'large' data sets are on file (NFS).
jeff dinisco
NAS is definitely on top in the enterprise, although more and more a trying to get object and to do so without a gw, just need the apps to catch up
Stuart Miniman
highly fragmented throughout the industry - Scale-out NAS, object, public cloud, and all of the secondary storage options
Sagi Brody
Also people are being 'pushed' into using specific technologies to support feature sets. Example, if you want to use #Zerto to #DraaS then you need to consume your storage as #SAN to support the application consistency and replication.
Bryan Champagne
@dinisco I agree. The apps have definitely been a hold up
Nick Kirsch
In the hyperscale space, it's all object-y (which includes HDFS, IMHO.)
Chris Dagdigian
life science uses peta-capable scale-out NAS for unstructured data by default. 90% NAS with a mixture of parallel FS and object making up the rest
John Furrier
data storage is now a strategic decision so how it's operationalized by IT matter big time
John Furrier
I hear that from CxOs but there aren't many uses cases or track record so this is hot area
Bryan Champagne
trying to get the search and understanding of what the data is has been a key initiative for many of our customers
Sagi Brody
The key to object is S3-Compliance from an application perspective. Its great to see CloudBerry, Rubrik, and other backup products supporting S3 natively
Sagi Brody
The medical field is starting to adopt native S3 as well
Chris Dagdigian
@webairsagi 100% this. No S3-compatible API, no deal.
jeff dinisco
agree with @nkirsch, these guys are consuming the most and have built the apps from the ground up to leverage object, the most scalable/manageable option, most enterprises are envious
Sagi Brody
Its a chicken and the egg when it comes to S3 and Enterprise. If your building a DevOpps net-new app, easy. but if you're talking to enterprise you're looking at a gateway appliance in front of the object store for a while..
Stuart Miniman
"still NAS but objects are getting real" via @DeepStorageNet