[LIVE CHAT] Managing Billions of Files

Q5. What does your current lifecycle look like between snapshots, archives, and longer term backups? Do you delete the data when it gets old?

1 Votes Vote

[+] Show Hidden Comments

[-]

John Furrier

Who deletes data is another question? serious question?

2 Votes Vote

[-]

Sagi Brody

This is based on the governing regulatory requirements of the user and vertical. Some like #HIPAA dont have an exact de-facto standard and are up to interpretation.

2 Votes Vote

[-]

Sagi Brody

People who get billed per GB :)

2 Votes Vote

[-]

jeff dinisco

long term snaps are still pricey and don't protect against a variety of scenarios leaving traditional backup as a necessary evil in most enterprises

5 Votes Vote

[-]

Sagi Brody

I believe surveillance and body cam recordings is only going to push out the storage retention tail required..

2 Votes Vote

[-]

Bryan Champagne

Compliance and policy determine deletion. We see people struggle with actually adhering to deletion policies unless they have it automated

2 Votes Vote

[-]

Dave Vellante

I've never seen a successful, sustainable example of deleting data/reclaiming wasted space

4 Votes Vote

[-]

Chris Dagdigian

super brittle at the moment. No consistent patterns. Some scientific storage is too big for cost effective full-lifecycle mgmt so many have just 1 primary and 1 last-resort-backup-and-archive copy of their data.

2 Votes Vote

[-]

jeff dinisco

@furrier totally agree, I meet very few customers willing to push the delete button, expiration maybe, active delete, don't see it

4 Votes Vote

[-]

Nick Kirsch

Ironically, lots of machine-generated data /could/ be deleted and re-generated, but the software gets in the way.

2 Votes Vote

[-]

Sagi Brody

What about real-time stream data analytics? We are also seeing this reduce the requirement to store data because of its real time nature..

3 Votes Vote

[-]

Dave Vellante

@webairsagi love the steaming example - it's almost a new workload - batch, interactive, streaming

2 Votes Vote

[-]

Christian Smith

Common retention policies - 1 week on primary (snapshots), 3+ months on secondary. Never delete.

3 Votes Vote

[-]

Nick Kirsch

@webairsagi It seems that many of these streaming workloads will also be related to machine-learning and anything with ML will be data retention heavy...

1 Votes Vote

[-]

Dave Vellante

@webairsagi Streaming App Value Chain---Ingest->Explore->Process (Predict)->Serve-> ?Persist?

2 Votes Vote

[-]

Dave Vellante

@nkirsch I think about a smart meter stream with data that doesn't change - don't persist - vs weather data which I might persist

1 Votes Vote

[-]

Nick Kirsch

Isn't there some value in that smart meter stream? Anything with potential predictive benefit, coupled with low cost of storage, seems likely to be persisted.

1 Votes Vote

[-]

John Furrier

Brent M. Piatti ‏@BrentPiatti said
Truth! Problem is most organizations see data storage as a cost center rather than a potential gold mine.

1 Votes Vote

John Furrier41

Q3: How do you know what you have? Is your data useful? Do people still access it after 3 to 5 years?

1 Votes Vote

[+] Show Hidden Comments

auditd :)

how much are people leveraging vs simply saving for compliance?

1 Votes Vote

[-]

jeff dinisco

old adage that it's cheaper to save than delete still holds true with < 1PB data sets

6 Votes Vote

[-]

Bryan Champagne

This is what we discuss daily with our customers right now. Access after time is a funny thing as the data will be there but whether it can be read is another question

3 Votes Vote

[-]

Sagi Brody

Enforce chargeback rules inside the organization and run your IaaS or internal infrastructure as a multi-tenant system. This forces accountability/ownership of data to steakholders. not to mention makes you scalable and is good for security..

2 Votes Vote

[-]

Sagi Brody

Oh yea, and analytics help. You're seeing lots of storage platforms build in analytics into the storage platform itself now.

2 Votes Vote

[-]

Nick Kirsch

I continue to see many people implement side-band meta-data stores, using tree walks, ElasticSearch, proprietary MAMs, etc. to try and keep track of everything.

2 Votes Vote

[-]

Chris Dagdigian

in pharma/biotech the results are pretty consistent. Nobody touches 98% of data 30 days after it was created. Lack of data awareness compounded by the fact that human data curators are more expensive than just adding capacity to a tier and punting

ie #ELK stack

traditional fs attributes, e.g access times, aren't incredibly useful in most cases, tagging on the way in is the key

6 Votes Vote

[-]

Nick Kirsch

In an online/nearline archive data is often a different file system or mount point - users are trained or led on how to recovery data themselves. If secondary storage is tape, that typically requires a ticket.

1 Votes Vote

[-]

Chris Dagdigian

The funny thing is that storage cost are more expensive than some lab/science experiments so in biotech/pharma it is common to DELETE primary data because it is cheaper to rerun the experiment than manage data through a long lifecycle

1 Votes Vote

[-]

John Furrier

If someone doesn't know where their data is then it's trouble time

0 Votes Vote

[-]

jeff dinisco

@nkirsch totally agree, wondering at what scale tree walks just become impossible though, thoughts?

4 Votes Vote

[-]

Chris Dagdigian

- Lots of good filesystem crawlers out there like https://sourceforge.... etc.

2 Votes Vote

[-]

Nick Kirsch

@dinisco tree walks are like backup windows - they almost always take too long and have too much impact on the primary system...

2 Votes Vote

John Furrier39

Q1: What technologies are storing the unstructured data you are seeing today? What is the relative breakdown across SAN, NAS, on-premises object, and public cloud that you are exposed to?

0 Votes Vote

[+] Show Hidden Comments

[-]

Bryan Champagne

We have been seeing a combination of NAS, on-prem, or public cloud depending on where the customer is in their adoption cycle of cloud technologies

2 Votes Vote

[-]

Sagi Brody

Because of the requirement for HA Virtualization, SAN is still highly used and very important. However the majority of 'large' data sets are on file (NFS).

2 Votes Vote

[-]

jeff dinisco

NAS is definitely on top in the enterprise, although more and more a trying to get object and to do so without a gw, just need the apps to catch up

8 Votes Vote

[-]

Stuart Miniman

highly fragmented throughout the industry - Scale-out NAS, object, public cloud, and all of the secondary storage options

2 Votes Vote

[-]

Sagi Brody

Also people are being 'pushed' into using specific technologies to support feature sets. Example, if you want to use #Zerto to #DraaS then you need to consume your storage as #SAN to support the application consistency and replication.

2 Votes Vote

[-]

Bryan Champagne

@dinisco I agree. The apps have definitely been a hold up

3 Votes Vote

[-]

Nick Kirsch

In the hyperscale space, it's all object-y (which includes HDFS, IMHO.)

2 Votes Vote

[-]

Chris Dagdigian

life science uses peta-capable scale-out NAS for unstructured data by default. 90% NAS with a mixture of parallel FS and object making up the rest

2 Votes Vote

[-]

John Furrier

data storage is now a strategic decision so how it's operationalized by IT matter big time

1 Votes Vote

[-]

John Furrier

I hear that from CxOs but there aren't many uses cases or track record so this is hot area

1 Votes Vote

[-]

Bryan Champagne

trying to get the search and understanding of what the data is has been a key initiative for many of our customers

2 Votes Vote

[-]

Sagi Brody

The key to object is S3-Compliance from an application perspective. Its great to see CloudBerry, Rubrik, and other backup products supporting S3 natively

3 Votes Vote

[-]

Sagi Brody

The medical field is starting to adopt native S3 as well

1 Votes Vote

[-]

Chris Dagdigian

@webairsagi 100% this. No S3-compatible API, no deal.

1 Votes Vote

[-]

jeff dinisco

agree with @nkirsch, these guys are consuming the most and have built the apps from the ground up to leverage object, the most scalable/manageable option, most enterprises are envious

5 Votes Vote

[-]

Sagi Brody

Its a chicken and the egg when it comes to S3 and Enterprise. If your building a DevOpps net-new app, easy. but if you're talking to enterprise you're looking at a gateway appliance in front of the object store for a while..

2 Votes Vote

[-]

Stuart Miniman

ping @gofarley @clearskydata @esignoretti @chrismevans

0 Votes Vote

[-]

Stuart Miniman

"still NAS but objects are getting real" via @DeepStorageNet

0 Votes Vote