controlmbigdata

Automate Data-Driven Processes
Issues, trends, & opportunities in automating data management workloads
Ralph Finos
What platforms are you using to automate data workload management processes? http://www.via-cc.at...

Joe Goldberg
Soke with an org that spends $150K per physical, bare-metal node in their cluster. In the cloud, the cost is 1% of that per node. Pretty clear where they will be going.
jameskobielus
A fair amount of what I'm seeing is that organizations are moving toward data management automation on cloud-based Hadoop and NoSQL platforms. Built for scale, speed, and high throughput.
Ralph Finos
How easy has it been to implement automated workload management processes that span your enterprise data architecture? http://www.via-cc.at...
Joe Goldberg
There are lots of ways to build it quick and dirty. Operating in the enterprise, reliably and at scale, not so much.
James (Jim) Perrone
not easy at all. We have a national IT group and have to take what is offered. In turn to get something new it has to be 'sold' to the national level to provide it and then sold to te internal customers as well..
Joe Goldberg
As DevOps and dev teams in general become more tightly integrated into on-going enterprise operations, they realize building only for speed will result in re-work. You have to build for reliability/qulity too.
Ralph Finos
http://www.via-cc.at...

Joe Goldberg
Everyone seems to be ingesting a ton of data so traditional collection and cleansing are huge.
Joe Goldberg
Recently heard from a bank they are dumping everything into a data lake and now are trying to figure out how to resolve duplication.
Dave Vellante
@GoldbergJoe sounds like a mess waiting to happen!
James (Jim) Perrone
would be very useful in my place.
Joe Goldberg
It's super powerful to piggyback on existing automation.
Basil Faruqui
Lot of people embarked on building data lakes without understanding how the data will be used to solve business problems resulting in data puddles.
Joe Goldberg
"Just" add another step to the mainframe table load to siphon off that same data into your data lake.
jameskobielus
From my Wikibon research, the core data mgt workloads that orgs are automating include data acquisition, movement, cleansing, storage, and analytics. In addition, more of the machine-learning dev/deploy pipeline is being automated.
Joe Goldberg
Folks consider data lakes to be 'immmutable" so the "Transform" part has to be fully automated to meet that goal.
jameskobielus
@GoldbergJoe Depending on the data type, deduplication can cut storage costs by as much as a factor of 20. That's for structured relational data.
jameskobielus
@BFaruqui Data lakes should be the core resource for data-driven app developers and data scientists to do their exploration, modeling, and model training.
Joe Goldberg
@dvellante It is INDEED a mess HAPPENING right now. Without some discipline and focus on automaiton, the mess becomes system and eventually lethal to projects.
jameskobielus
@GoldbergJoe Data lakes are immutable. But the sources from which they draw data (eg., OLTP systems) may not be, nor are the applications to which they deliver data/results (e.g, CRM systems).
James (Jim) Perrone
How does one 'sell' the idea when it is new to the company and get past the hesitation of "It's never been done here before ?"
Basil Faruqui
What I have seen working well is identifying a particular use case that people in the organization agree could be improved by better analytics. For example a common use case in financial services in fraud detection.
jameskobielus
Automation can be a hardsell if you position is as potentially eliminating positions. It's best to focus on benefits of improving speed, quality, and consistency of key business processes.
jameskobielus
Another approach, when trying to sell the notion of automating data-driven business processes, is highlight how it enables makes knowledge workers more productive and reduces their need to slog through low-level data looking for insights.
James (Jim) Perrone
Thanks Guys....I will take note as the topics come up for sure....
Peter Burris
Draw a distinction between the technology and the discipline: The tech might never have been done before, but analytics typically have. For example, Bayesian analysis has been in use for decades. Show returns on analytics first.
Robby Dick
And it's modern automation, not the "same old" automation they think they solved for 20 years ago! Quite a few things have changed and the automation platform needed behind all these megatrends also needs to be a modern solution!
Basil Faruqui
speed and quality are both benefits of automation so it is important to position it in the context of how automation will help deliver a big data project with speed and also scale!
James (Jim) Perrone
Its tough for sure especially where I am given the nature or our work. All good ideas...
jameskobielus
@robbydbmc Right. Automation scripts can be a reusable asset in your application architecture. Old-fashioned automation was bespoke, tactical, and limited to a particular implementation.
Ralph Finos
How do you measure the business return from data-driven applications? http://www.via-cc.at...

Joe Goldberg
Navistar is a truck manufacturer that has reduced vehicle repair wait time by 40% from its use of telematics streamed from their trucks.
Peter Burris
At a high level: Customer experience first; automation and cost abatement second.
Joe Goldberg
They defined an initial target for imporvement when they started the project and then far exceeded expectaitons.
Alon Lebenthal
providing better services to the customers (and providing these faster)
Robby Dick
That has allowed Navistar to monetize a new offering, essentially data and the interrogation of it!
Joe Goldberg
The differnece between successful and failed Big Data projects was always the up-front BUSINESS use case
James (Jim) Perrone
I recall their presentation at Engage last time.... Very informative...
jameskobielus
Measuring the business return from data-driven applications can be qualitative and/or quantitative. Example of the former is smarter business decisions. Examples of the latter include reduced customer churn and boosted response rates.
Basil Faruqui
The key is to being successful in measuring success is making sure you have a problem which you can baseline and then measure how applying insights from big data has improved the process
jameskobielus
@GoldbergJoe And that, no doubt, has improved their customers' measurable ability to ship on time, since trucks are available more of the time.
Neil Raden
@GoldbergJoe Navistar has done an excellent job at this, I've spent some time with them lately. However, they were really updating work they's done with Qualcomm analog tech
jameskobielus
@BFaruqui Right. This measurement can be drawn from anecdote, but, ideally, it should be grounded in quantitative metrics that are inline to your application architecture.
James (Jim) Perrone
The words Big Data are just starting to be spoken in my role. I am here to learn so I can be able to offer ControlM to help in that matter....
jameskobielus
Great. It's important not to overstate the "big" part of the equation. An automation architecture should operate at any data scale, enabling development of repeatable data pipeline artifacts (integration scripts and machine learning models).
Ralph Finos
http://www.via-cc.at...

Basil Faruqui
The workload management and automation is often left as an after thought. Companies tend to address is when they are close to production and that is where they realize that now a lot of rework must be done to automate at scale
Alon Lebenthal
Many processes do span across the enterprise and integration is a key but then , you need to have the right tools to work across the enterprise
Robby Dick
From the discussions I am in, it has not been easy, which is too bad as I think there are solutions that can make it "easy", or at least "easier" as compared to what most are expecting!!
jameskobielus
@BFaruqui Right. Once IT has set up the essential data management pipeline, it grows clearer how much of that is repeatable and is amenable to automation going forward.
Robby Dick
Exactly - Automation isn't thought of as a platform that can help with these modern issues as I think many view automation platforms as something they addressed many years ago that cannot help in a modern data architecture.
jameskobielus
It's easier to for data management workload automation to span the enterprise if most apps rely on a core set of enterprise data platforms. That requires consolidation. Perhaps standardization on public clouds.
jameskobielus
@robbydbmc Actually, I'm seeing more data managers who recognize the fact that big data and cloud data are simply unmanageable without automation. Not enough data mgt FTEs to go around (within budget) to do it all.
Ralph Finos
What are the principal platforms in your big-data analytics architecture?

What are the principal platforms in your big-data analytics architecture?

Ralph Finos
What are the principal platforms in your big-data analytics architecture? http://www.via-cc.at...

Joe Goldberg
Hadoop is very common with streaming and "SQL" engines on top of HDFS.
Joe Goldberg
SPark seems to have completely displaced MapReduce.
Joe Goldberg
In the SQL-on-Hadoop space, there's a ton of diversity with no dominant technology appearing yet.
Alon Lebenthal
I have been hearing from many customers of extensive usage of Spark
Joe Goldberg
Also looks like everyone is looking at cloud and cloud-based services like EMR, BigQuery, HD Insights, etc.
Basil Faruqui
Cloud is becoming the enabler for big data solutions to be delivered with speed and scale. Streaming technologies like Kafka are completely reshaping the traditional ETL model
jameskobielus
@GoldbergJoe Spark has displaced Hadoop's MapReduce in the front-end data access, query, and modeling for machine learning apps. Hadoop's HDFS, however, remains solidly entrenched for multistructured data acquisition, storage, & preprocessing.
Basil Faruqui
As companies are looking at cloud they are looking to leverage cloud storage systems instead of HDFS. Malwarebytes recently moved their entire storage to S3 and Hadoop is now used as a data processing engine
jameskobielus
@AlonLebenthal Soon, you should be also hearing of customers using deep-learning tools, such as TensorFlow, in conjunction with Spark. Key data developer focus, from my research.