BigData

Systems of Intelligence
Conversation with Wikibon #bigdata research analyst on Systems of Intelligence & it's impact

Which statement best reflects the future of big data?

   8 years ago
#bigdata@theCUBE Talks Big DataConversation with George Gilbert Wikibon Analyst about bigdata Hadoop & other cool news & trends
Dave Vellante
Q6. George...What does it mean for Spark and Hadoop to co-exist...sounds nice but what's the customer imperative there?
Rodrigo Gazzaneo
the Hadoop adoption curve has not peaked yet. Still room for growth. Use case will define the best tool.
Kirk Borne
Spark is about fast processing. Hadoop is about distributed data (files) access for processing. They co-exist.
George Gilbert
There is a school of thought that if you go all in on #ApacheSpark, you don't even need #Hadoop core: #HDFS and #YARN. but storage layer is key for hand-offs between jobs when Spark isn't the end-to-end processing engine
Jen(Cohen)Cheplick
You want to use the right "tool" for the job, depending on data type, batch vs. real-time, etc.
Jen(Cohen)Cheplick
This was an interesting article I read a few weeks back. High level, but makes the point of co-existence. http://www.forbes.co...
George Gilbert
@vGazza Hadoop is becoming a key part of enterprise infrastructure. Some part of it - at least #HDFS and #YARN should be common foundation - for a while. #ApacheSpark will have its own take on storage integration at some time
Rodrigo Gazzaneo
@ggilbert41 what other object stores can #ApacheSpark use as persistence layer other than #HDFS?
George Gilbert
@ggilbert41 all in on Spark to some means Databricks - which has its own stack all the way down to the metal
Kirk Borne
Spark can be MUCH faster than Hadoop processing, so Spark is red hot right now. That's fine, but Hadoop is not going away anytime soon.
Jen(Cohen)Cheplick
@ggilbert41 George & all - where are you seeing Spark take off? What use cases and/or industries
Kirk Borne
@ggilbert41 Great comment about the Databricks implementation, which you get on @MapR. I would buy that for just that reason.
George Gilbert
@jscheplick there are a growing number of data tools ISVs who are using it as their data management back-end - #Zoomdata is a great example. also Internet services vendors who have the skills are working with ML and streaming
David Floyer
Start with a truck (Hadoop), move to a pick-up (Spark), end with an max plaid Tesla (streams)
David Wild
Spark's in-memory efficiency great for complex network analysis across heterogeneous data sources
John Furrier
Q5: What about the impact of #Spark on this mega trend of Systems of Intelligence? the pre chat poll was: a) it will turbo charge hadoop; 2) disrupt hadoop 3) neutral 4) no impact?
George Gilbert
#ApacheSpark makes a lot of analytics easier and faster by running different workloads on the same engine. Still needs more performance improvements. #IBM contributions could really change things
David Floyer
Horses for courses; Hadoop is batch and most efficient/greatest throughout. Spark is microbatch, gets an answer quicker, but less efficient/slower throughout.
Dave Vellante
@dfloyer Is Hadoop the new tape :-)
David Wild
Means you can develop for scale but still work on tab files
George Gilbert
@dfloyer Project Tungsten for #ApacheSpark should get them to pure streaming but that will never completely replace need for batch and #Hadoop
Rodrigo Gazzaneo
@dvellante HDFS is the new long term storage media Mapreduce and Spark can read from
Kirk Borne
Spark (fast, in-memory) vs Hadoop (batch) = the TWIN TOWERS of the Lord of the Things (#IoT): my new @MapR blog will discuss this
Jen(Cohen)Cheplick
@ggilbert41 Completely agree. There is a role for both
David Floyer
Infostreams is real-time, can be supported in development by Hadoop and Spark
Rodrigo Gazzaneo
@KirkDBorne loved the Twin Towers analogy! #LOTR
George Gilbert
@KirkDBorne the debate about running all analytics on fast, in-memory (i.e. Spark) is likely misleading
George Gilbert
@KirkDBorne the analysis around throughput / volume is likely to have different logic than analysis around per event updates
Rodrigo Gazzaneo
@ggilbert41 Memory x Flash x Disk is a matter of cost per capacity and potential revenue from insight
Kevin Petrie
@ggilbert41 Great SoI deck. Can you elaborate on why Spark is at "slow" end of innovation axis on Slide 24? Would think it is high innovation level
Jeff Frick
@dvellante > Hadoop = Tape - Love the analogy. What is "spinning rust?" in this model?
George Gilbert
@KevinPetrieTech good question: it was a tough call - but having the wild west ecosystem of databases or even the Hadoop ecosystem means each component can evolve independently at own pace. Spark libraries must evolve to integrate with each other
Kevin Petrie
@ggilbert41 Got it. Spark has arisen quickly, but nature of Spark libraries throttles future innovation vs. other platforms
Sugandh Mehta
somewhere heard the anology of DW (bottled water) versus DL (natural stream)
Kirk Borne
See my article on the data lake, sea, ocean, flood, tsunami, stream,... https://www.mapr.com...
John Furrier
Q4: Besides big and semi-structured data, just how different are Data Lakes and Data Warehouses?
George Gilbert
#datalake is repository for unrefined, uncurated data that data scientists and biz analysts can explore. repeatable analytics can go to DW or a production #Hadoop cluster
Rodrigo Gazzaneo
it's about the schema. Data Lakes are flexible, Data Warehouses are rigid.
Jen(Cohen)Cheplick
Data lakes should be easier to adapt to changing business & infrastructure needs vs. EDW
George Gilbert
@vGazza also correct. the flexible schema is part of making the #datalake a self-service environment - you add the schema as you explore
Kirk Borne
Diverse multi-source heterogeneous data sets are the norm in Data Lakes, but not in Data Warehousing.
Rodrigo Gazzaneo
Data Lakes support schemas on demand, so you can improve the models continuously
Crowd Captain
Data warehouses and data lakes are slow and sound so old..what's new & different with Systems of Intelligence models ?
Kirk Borne
Data Warehouse = schema on write.. Data Lake = schema on read.
Rodrigo Gazzaneo
@CrowdCaptain Data Lakes support flexible ingestion and insight layers, so they can be fast also
Kirk Borne
Data Lake allows easy updates (data "columns"). DW requires new schema and index builds when adding new
Jen(Cohen)Cheplick
Data lakes retain all (more) data vs. DW
John Furrier
ok it's all about the #dataocean bc oceans have currents and are always highly dynamic so the real time intelligence algos and tech are in the #dataoceans
Jasdeep Singh
DW are archived hierarchially... Data Lakes are mostly object-based
George Gilbert
@vGazza flexible ingestion enables self-service, but repeatability invites structure and therefore performance
Jen(Cohen)Cheplick
We are talking a lot about technical differences, but what is the difference for business value?
Kirk Borne
@CrowdCaptain #SystemsOfIntelligence should derive their value from its apps (smart Machine Learning) not its data model (DW or Data Lake)
David Floyer
data lakes are just a cheaper and bigger version of the failed data warehouses model
George Gilbert
@KirkDBorne it's all about flexibility vs. performance trade-off
Rodrigo Gazzaneo
a Data Lake can be a source of data for a Data Warehouse once you know what to ask
George Gilbert
@KirkDBorne Kirk, i couldn't have said it better. Data Lakes are training wheels. #MachineLearning is what drives Systems of Intelligence
Rodrigo Gazzaneo
@ggilbert41 systematic queries can be optimised and run on Data Warehouses for performance
David Floyer
data streams, data rivers power things along. Data lakes fester.
George Gilbert
@vGazza that's why DataLakes coexist with DataWarehouses - exploration vs. production performance
David Wild
Completely new methods needed for Data Lakes. #Machine Learning and stats miss many of the possibilities
Jeff Frick
@CrowdCaptain > Getting out front in the decision process.