Sometimes overlooked in the Data Lake conversation - Archive Data or Legacy and Retired Systems of record - Data is still very useful especially for long term trend - but how to place it in the Data Lake and how to control it properly
@mcauth Data retention is a key issue, especially for the scads of infastructure data too low-level (or not relevant to compliance) enough to merit long-term retention. There's just so much log data that has marginal value for retaining.
Look, the proper place to USE that data is still a data warehouse, and they have overcome many of their drawbacks. I say, cloud DW, not data lake if you think it will be used.
@NeilRaden but surely Clolud DW, Data Lake, EDW are all part of the Enterprise Data Fabric - should we not worry where the data is held as long as it can be used across the Enterprise?
In other words, how much friction is there for people trying to find the data and analytics: Do they need to know a query language? Can they easily search? How long do they have to wait?
I'll parse this along the boundaries of data sources: Machine data (logs, snmp, etc.) Agent data, synthetic data, and wire data. Most are quite easy to get at, it's harder to wrangle it into value.
Varies *wildly* with tooling and instrumentation. For some I've spoken with it's hopeless in their current deployment. For others it's seconds to visibility and insight. It's all about surfacing insights rapidly, automatically.
@dorkninja Great point! But as IT resources become more core to business behavior, a more diverse user community will demand simple ways to find, collaborate on, and act on answers.
@yaronhaviv Agree. There is a "half-life" to data value and it differs based on the analytics use case. With incident response and downtime, the half-life of data value is shorter, but for long-term capacity planning the half-life is longer.
@dorkninja Hilariously data is the thing that allows you to replace everything else. Or fix it. Or learn from it. Or ... #AllTheThings. Data is the truth/life/love. #BringMeData#AndBacon
@RalphFinos you have to know how to use and what you are doing. data itself is kind of useless unless you know what you are doing and can use the right tools
@plburris Agree Peter, regulatory compliance rules (GDPR et al) will start to rain this problem in I think. It will be interesting to see if it has an affect on the use of data in the Data Lake as this is the usual place the data is just thrown
@RalphFinos Agree. But there are also cases where data that's useless today, and tomorrow, and the next day - becomes invaluable 10 years from now. See: The CDC.
@dorkninja This is interesting, and 100% true. Data's value changes over time depending on the use case. Too few take this into consideration, let's hope that changes :)
@dorkninja I'd argue that's not data that's useless today or tomorrow. It's just a single data point that is useful in a trend, rather than on its own. Also something we can catalog, capture and covet.
@NeilRaden From an ops perspective - as an ex-Ops person myself, a very simple value model I use is (velocity / friction ) * the number of users that can put the data into action. Not academically rigorous but folks find it useful.
@mcauth - There should be some conformity, or canonical models, in industry verticals. There may be different valuation models across departments. Unless there is a market for data, all data valuation models will be subjective
@dorkninja Data often indicates a state, status, or condition of the infrastructure at a point in time, or it can represent a trend over time. For infrastructure, all s essential for historical analysis, real-time monitoring, and preventive.
most applications today are vomiting data - state, status, condition in a vacuum. not actionable - need AI to synthesize and deliver accelerated insight
@Jshoc Largely true, especially given sheer volume. But it's also possible to - from a practice level - extract the stuff that you know matters and present it proactively, pre-AI.
Question #2 is? (1) Do you aggregate data from all your ITOps/DevOps, etc. tools into some data aggregation platform? (2) Is your data still stuck in some monitoring tool? (3) Are you beyond (1) and run some algorithms against your data?