[LIVE CHAT] Spark Summit SF

Crowd Doc12

How much overlap in use-cases between Spark and Impala? cc/ @cloudera

3 Votes Vote

[+] Show Hidden Comments

[-]

Dr.Cos

Actually not much. Impala is more from Shark's realm, where Shark is performing much better anyway you look at it

1 Votes Vote

[-]

Crowd Doc

Yeah, I meant to ask Spark + Shark >= Impala ?

1 Votes Vote

Dr.Cos Without starting a flame war, I think the fact that Cloudera now supports Spark is very telling. Also, you might want to check http://www.eweek.com/cloud/hadoop-drives-down-costs-drives-up-usability-with-sql-convergence-5/

4 Votes Vote

Crowd Captain Did @cloudera "Jump the Shark" ??

1 Votes Vote

[-]

Vaibhav

Impala, Stinger, Preso, HAWQ and Shark are in the same 'bucket' if you will. Spark + Shark have a distinct advantage esp due to the BDAS stack

1 Votes Vote

[-]

John Furrier

@cloudera aren't dummies they see the future.. i don't understand why they don't talk about it more..their silence is deafening

1 Votes Vote

Dr.Cos The was a community voting process for the submissions. I think there were 38 submitted talks all together. Can't tell more really...

0 Votes Vote

John Furrier12

question for @vnivargi I was talking to Sharmila at @clearstorydata and she is very bullish on Spark .. obviously they are analytics for the BI consumer ..why are you guys working with Spark & give us a taste of the results

5 Votes Vote

[+] Show Hidden Comments

[-]

Vaibhav

Spark has been working very well for us. The expressive power of Scala and the flexibility of RDDs works well for the interactive workloads our customers are seeing

5 Votes Vote

Crowd Captain is the speed advantage really a big deal

0 Votes Vote

John Furrier I like resilient distributed datasets but is Scala the requirement for programming - what about python?

0 Votes Vote

[-]

Vaibhav

Our backend stack is implemented in Scala, so there is a natural fit. I've spoken to folks who use the Python and Java bindings as well

1 Votes Vote

[-]

Vaibhav

Low latency is very important for our workloads, so is the fault tolerance and lineage of RDDs

1 Votes Vote

Dr.Cos12

It is new generation of data analytic platform. MR is batch and slow. New applications need interactive system to churn models quickly. That's why spark is gaining popularity so quickly. Commercial companies are backing it up: first @WANdisco, now others

8 Votes Vote

[+] Show Hidden Comments

[-]

Dean of Big Data

What is Spark's relationship with YARN?

0 Votes Vote

Will Davis Support for running Spark on YARN was added to Spark in version 0.6.0, and improved in 0.7.0 & 0.8.0 - http://bit.ly/1bSvsTE

2 Votes Vote

theCUBE There's no relation between Spark and YARN. The latter is a Hadoop resource scheduler. Spark supports YARN and can work as a YARN application though. But the same way it works with AMPlab Mesos https://www.crowdchat.net/post/4903

0 Votes Vote

[-]

Crowd Captain

is the use case of Spark only limited to social data or only Graph DBs and Machine Learning environments?

0 Votes Vote

Stephanie McReynolds We're using Spark across a wide variety of use cases. Social analysis is there but more so, supply chain ditribution, localized market demand, and a host of others. Anytime exploratory analysis is key. @CrowdCaptain

2 Votes Vote

John Furrier11

Why is the distinction between the use-cases for "realtime analytics" and real-time query serving so important?

4 Votes Vote

[+] Show Hidden Comments

[-]

Dr.Cos

Actually "Real-time" has a very specific SLAs. There's really no such thing as real-time analytic. Even HBase isn't real-time. However, in-memory systems are highly advantageous because of the performance. The importance lays in the speed too.

1 Votes Vote

Jeff Frick Always enjoy the "real time" discussion and definition. At what point is "Once per unit time" good enough when unit time is greater than 0? Usually find a unit that provides value, far north of 0. #RealValue

0 Votes Vote

theCUBE Hbase isn't real time and some are moving to #AWS now with #kenesis and #redshift offer compelling closed loop data for real time..not saying they are real time but have great queuing stack

1 Votes Vote

Stephanie McReynolds Agreed that "real-time" is an overused term. Even algo traders argue about what is really "real-time". Better to look at right time for the use case. But everyone wants speed...

2 Votes Vote

[-]

Stephanie McReynolds

You have to query before you analyze. So having access to both in one system is key. Querying in real-time is often easier than analyzing in real-time @furrier

1 Votes Vote

[-]

Scott Howser

Stephanie's point below is right on. I would add that the term "realtime" has many different meanings and expectations depending on the use case, industry, application, etc. In my experience a completely over-utilized term. Focus on SLA per application!

2 Votes Vote

Dr.Cos10

Another great advantage of Spark that it isn't really Hadoop specifc and is Hadoop agnostic (in terms of versions). It supports pure open source Hadoop implementation and such offerings and CDH

8 Votes Vote

[+] Show Hidden Comments

[-]

Crowd Doc

Looking forward to someone offering "Spark as Service". As a startup our "resource scheduler" demands, we don't spend much time patching and updating all these separate components ;)

0 Votes Vote

[-]

Jeff Kelly

that's important, so as not to limit applicability in the enterprise - the #hadoop battle is still being waged

0 Votes Vote

Dr.Cos Battle is winding down, but there's still a few vendors that aren't really compatible with each other, unfortunately. Perhaps, Spark can be that sort of join point for them :)

1 Votes Vote

[-]

John Furrier

being Hadoop agnostic is HUGE for this effort - kudos to @cloudera for supporting this new direction

0 Votes Vote

John Furrier @theCUBE https://twitter.com/c0sin/status/405388210336845824

c0sin

@furrier @cloudera @WANdisco - the pioneer of commercial support for Spark is support the summit, BTW ;)

5 minutes ago

View it on Twitter

0 Votes Vote

[-]

Vaibhav

Spark also enables working with the rest of the BDAS stack, enabling Graphx, MLBase and Spark Streaming very easily

1 Votes Vote

Jeff Kelly9

What is the state of the Spark community/ecosystem? Who are the main vendors supporting the community? Databricks, WANdisco, Cloudera ... who else? Intel? Yahoo?

3 Votes Vote

[+] Show Hidden Comments

[-]

Dr.Cos

Community wise, Spark has became ASF incubator project not that long ago and already did first incubation release! The community is vibrant and being roughly 3 years into development is over-passing that of Hadoop at the same point

1 Votes Vote

John Furrier Spark is my opinion is the next big wave in big data bc it advances the mission of MR and extends the market - rising tide baby!! floating the business & tech value boats!! #bigdata #hadoop

1 Votes Vote

[-]

Vaibhav

More info here: https://cwiki.apache.org/confluence/display/SPARK/Committers and here: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

3 Votes Vote

Jeff Frick Sharing the Knowledge - Thanks

0 Votes Vote

[-]

John Furrier

AWS is a sponsor so I suspect they will be integrating Spark in their cloud stack.. no mention of Pivotal/VMware/EMC ..

0 Votes Vote

Jeff Kelly AWS will no doubt make Spark available if that's what its customers want

0 Votes Vote

[-]

Dr.Cos

The set of topics covered at the Summit is very illustrative on the state of the community: http://spark-summit.org/agenda/

1 Votes Vote

John Furrier8

What happens when a node crashes in Spark? Is the data replicated over the network or is it persisted in memory? re: data collections - thoughts from #techathletes here?

3 Votes Vote

[+] Show Hidden Comments

[-]

Scott Howser

What about intermediate results in complex queries? Are they persisted or mem based?

1 Votes Vote

[-]

Vaibhav

RDDs in Spark are fault tolerant, so a node crashes the same 'operations' can be reapplied to re-derive the RDD. Extends to Tachyon as well

2 Votes Vote

John Furrier that is what I was looking for thx for the data there :-)

0 Votes Vote

Dr.Cos Yup, still the master node isn't fault-safe

0 Votes Vote

[-]

Dr.Cos

Spark has a notion of fault-tolerance at RDD level - by design of it. However, high-availability solutions would be very beneficial for Spark master node and Spark context. That's where systems like @WANdisco is offering will be crucial.

2 Votes Vote

Dr.Cos And BTW: Spark would of course benefit highly from high or continuous availability of HDFS (selfish plug: go @WANdisco)

0 Votes Vote

Dr.Cos6

There's no relation between Spark and YARN. The latter is a Hadoop resource scheduler. Spark supports YARN and can work as a YARN application though. But the same way it works with AMPlab Mesos

4 Votes Vote

[+] Show Hidden Comments

[-]

Crowd Doc

What are the advantages/disadvantages of deploying Spark with Hadoop YARN ? Which one is the first class citizen as a resource manager for Spark ? YARN or Mesos ? cc/ @spark_summit

0 Votes Vote

Dr.Cos YARN suffers from higher scheduling latency compared to Mesos. Which is fine for most of the Hadoop applications, but is critical for Spark and alike.

1 Votes Vote

[-]

John Furrier

What are the advantages/disadvantages of deploying Spark with Hadoop YARN? What are the advantages/disadvantages of deploying Spark with Mesos? Which has better cluster mgt?

0 Votes Vote

[-]

Dr.Cos

On-Mesos deployment is very beneficial as it provides much faster scheduling than YARN. Also, with Mesos you can make Hadoop and Spark clusters coexist on the same infra.

1 Votes Vote

John Furrier6

Why is Spark so important and successful gain traction with developers in #bigdata

2 Votes Vote

[+] Show Hidden Comments

[-]

Stephanie McReynolds

For #bigdata analytics to have business impact, performance at the sped of thought is key. Spark delivers on fast, iterative analysis. @furrier

3 Votes Vote

Jeff Frick "Performance at the speed of thought" good one.

0 Votes Vote

[-]

Jeff Kelly

I see Spark as part of the larger evolution of #Hadoop from batch to multi-applications, in this case real-time analytics

1 Votes Vote

[-]

theCUBE

developers want a unified platform to abstract away complexities emerging for innovation ontop of MapReduce bc as @slangenfeld pointed out multipass, ad hoc queries, and interactivity are now table stakes speed and ease of programming key

0 Votes Vote

Subash D'Souza5

How do real time databases such as HBase fit into the Spark ecosystem. If not, is there any work being done on migrating or creating something similar

4 Votes Vote

[+] Show Hidden Comments

[-]

Will Davis

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc) - http://bit.ly/1fGp74L

1 Votes Vote

[-]

theCUBE

not sure people view hbase as real time

0 Votes Vote

Subash D'Souza i agree and it depends on how one would define real time. IMO anything with a quick response time is a valid case for it. :-)

0 Votes Vote

sparksummit

Stream Ended

sparksummit

Invite People to #sparksummit

1. Select Contacts

2. Compose Message

3. Send

Invite to #sparksummit

Stream Ended

Extend Time Prompt

How many minutes would you like to add?