[LIVE CHAT] Systems of Intelligence

Dave Vellante26

Q6. George...What does it mean for Spark and Hadoop to co-exist...sounds nice but what's the customer imperative there?

3 Votes Vote

[+] Show Hidden Comments

[-]

Rodrigo Gazzaneo

the Hadoop adoption curve has not peaked yet. Still room for growth. Use case will define the best tool.

1 Votes Vote

[-]

Kirk Borne

Spark is about fast processing. Hadoop is about distributed data (files) access for processing. They co-exist.

2 Votes Vote

[-]

George Gilbert

There is a school of thought that if you go all in on #ApacheSpark, you don't even need #Hadoop core: #HDFS and #YARN. but storage layer is key for hand-offs between jobs when Spark isn't the end-to-end processing engine

4 Votes Vote

[-]

Jen(Cohen)Cheplick

You want to use the right "tool" for the job, depending on data type, batch vs. real-time, etc.

3 Votes Vote

[-]

Jen(Cohen)Cheplick

This was an interesting article I read a few weeks back. High level, but makes the point of co-existence. http://www.forbes.co...

4 Votes Vote

[-]

George Gilbert

@vGazza Hadoop is becoming a key part of enterprise infrastructure. Some part of it - at least #HDFS and #YARN should be common foundation - for a while. #ApacheSpark will have its own take on storage integration at some time

1 Votes Vote

[-]

Rodrigo Gazzaneo

@ggilbert41 what other object stores can #ApacheSpark use as persistence layer other than #HDFS?

0 Votes Vote

[-]

George Gilbert

@ggilbert41 all in on Spark to some means Databricks - which has its own stack all the way down to the metal

2 Votes Vote

[-]

Kirk Borne

Spark can be MUCH faster than Hadoop processing, so Spark is red hot right now. That's fine, but Hadoop is not going away anytime soon.

2 Votes Vote

[-]

Jen(Cohen)Cheplick

@ggilbert41 George & all - where are you seeing Spark take off? What use cases and/or industries

1 Votes Vote

[-]

Kirk Borne

@ggilbert41 Great comment about the Databricks implementation, which you get on @MapR. I would buy that for just that reason.

0 Votes Vote

[-]

Dave Vellante

@jscheplick in the press!

0 Votes Vote

[-]

George Gilbert

@jscheplick there are a growing number of data tools ISVs who are using it as their data management back-end - #Zoomdata is a great example. also Internet services vendors who have the skills are working with ML and streaming

1 Votes Vote

[-]

David Floyer

Start with a truck (Hadoop), move to a pick-up (Spark), end with an max plaid Tesla (streams)

2 Votes Vote

[-]

David Wild

Spark's in-memory efficiency great for complex network analysis across heterogeneous data sources

0 Votes Vote

John Furrier27

Q5: What about the impact of #Spark on this mega trend of Systems of Intelligence? the pre chat poll was: a) it will turbo charge hadoop; 2) disrupt hadoop 3) neutral 4) no impact?

2 Votes Vote

[+] Show Hidden Comments

[-]

George Gilbert

#ApacheSpark makes a lot of analytics easier and faster by running different workloads on the same engine. Still needs more performance improvements. #IBM contributions could really change things

2 Votes Vote

[-]

David Floyer

Horses for courses; Hadoop is batch and most efficient/greatest throughout. Spark is microbatch, gets an answer quicker, but less efficient/slower throughout.

2 Votes Vote

[-]

Dave Vellante

@dfloyer Is Hadoop the new tape :-)

3 Votes Vote

[-]

David Wild

Means you can develop for scale but still work on tab files

1 Votes Vote

[-]

George Gilbert

@dfloyer Project Tungsten for #ApacheSpark should get them to pure streaming but that will never completely replace need for batch and #Hadoop

1 Votes Vote

[-]

Rodrigo Gazzaneo

@dvellante HDFS is the new long term storage media Mapreduce and Spark can read from

1 Votes Vote

[-]

Kirk Borne

Spark (fast, in-memory) vs Hadoop (batch) = the TWIN TOWERS of the Lord of the Things (#IoT): my new @MapR blog will discuss this

3 Votes Vote

[-]

Jen(Cohen)Cheplick

@ggilbert41 Completely agree. There is a role for both

1 Votes Vote

[-]

David Floyer

Infostreams is real-time, can be supported in development by Hadoop and Spark

0 Votes Vote

[-]

Rodrigo Gazzaneo

@KirkDBorne loved the Twin Towers analogy! #LOTR

1 Votes Vote

[-]

George Gilbert

@KirkDBorne the debate about running all analytics on fast, in-memory (i.e. Spark) is likely misleading

1 Votes Vote

[-]

George Gilbert

@KirkDBorne the analysis around throughput / volume is likely to have different logic than analysis around per event updates

1 Votes Vote

[-]

Rodrigo Gazzaneo

@ggilbert41 Memory x Flash x Disk is a matter of cost per capacity and potential revenue from insight

1 Votes Vote

[-]

Kevin Petrie

@ggilbert41 Great SoI deck. Can you elaborate on why Spark is at "slow" end of innovation axis on Slide 24? Would think it is high innovation level

2 Votes Vote

[-]

Jeff Frick

@dvellante > Hadoop = Tape - Love the analogy. What is "spinning rust?" in this model?

2 Votes Vote

[-]

George Gilbert

@KevinPetrieTech good question: it was a tough call - but having the wild west ecosystem of databases or even the Hadoop ecosystem means each component can evolve independently at own pace. Spark libraries must evolve to integrate with each other

2 Votes Vote

[-]

Kevin Petrie

@ggilbert41 Got it. Spark has arisen quickly, but nature of Spark libraries throttles future innovation vs. other platforms

1 Votes Vote

Sugandh Mehta4

somewhere heard the anology of DW (bottled water) versus DL (natural stream)

3 Votes Vote

[+] Show Hidden Comments

[-]

John Furrier

what about the #dataocean ?

0 Votes Vote

[-]

Kirk Borne

See my article on the data lake, sea, ocean, flood, tsunami, stream,... https://www.mapr.com...

1 Votes Vote

John Furrier60

Q4: Besides big and semi-structured data, just how different are Data Lakes and Data Warehouses?

3 Votes Vote

[+] Show Hidden Comments

[-]

George Gilbert

#datalake is repository for unrefined, uncurated data that data scientists and biz analysts can explore. repeatable analytics can go to DW or a production #Hadoop cluster

3 Votes Vote

[-]

Rodrigo Gazzaneo

it's about the schema. Data Lakes are flexible, Data Warehouses are rigid.

4 Votes Vote

[-]

Jen(Cohen)Cheplick

Data lakes should be easier to adapt to changing business & infrastructure needs vs. EDW

1 Votes Vote

[-]

George Gilbert

@vGazza also correct. the flexible schema is part of making the #datalake a self-service environment - you add the schema as you explore

3 Votes Vote

[-]

Kirk Borne

Diverse multi-source heterogeneous data sets are the norm in Data Lakes, but not in Data Warehousing.

2 Votes Vote

[-]

Rodrigo Gazzaneo

Data Lakes support schemas on demand, so you can improve the models continuously

1 Votes Vote

[-]

Crowd Captain

Data warehouses and data lakes are slow and sound so old..what's new & different with Systems of Intelligence models ?

3 Votes Vote

[-]

Kirk Borne

Data Warehouse = schema on write.. Data Lake = schema on read.

4 Votes Vote

[-]

Rodrigo Gazzaneo

@CrowdCaptain Data Lakes support flexible ingestion and insight layers, so they can be fast also

3 Votes Vote

[-]

Kirk Borne

Data Lake allows easy updates (data "columns"). DW requires new schema and index builds when adding new

1 Votes Vote

[-]

Jen(Cohen)Cheplick

Data lakes retain all (more) data vs. DW

2 Votes Vote

[-]

John Furrier

ok it's all about the #dataocean bc oceans have currents and are always highly dynamic so the real time intelligence algos and tech are in the #dataoceans

2 Votes Vote

[-]