sparksummit

Spark Summit SF
@theCube conversation with thought leaders about In-memory & Spark we are pregaming Spark Summit.
   9 years ago
#SparkSummitSpark Insight SF#SparkInsight @theCube conversation thought leaders about In-memory & Spark during Spark Summit.
Crowd Doc
How much overlap in use-cases between Spark and Impala? cc/ @cloudera
Dr.Cos
Actually not much. Impala is more from Shark's realm, where Shark is performing much better anyway you look at it
Crowd Doc
Yeah, I meant to ask Spark + Shark >= Impala ?
Dr.Cos Without starting a flame war, I think the fact that Cloudera now supports Spark is very telling. Also, you might want to check http://www.eweek.com/cloud/hadoop-drives-down-costs-drives-up-usability-with-sql-convergence-5/
Crowd Captain Did @cloudera "Jump the Shark" ??
Vaibhav
Impala, Stinger, Preso, HAWQ and Shark are in the same 'bucket' if you will. Spark + Shark have a distinct advantage esp due to the BDAS stack
John Furrier
@cloudera aren't dummies they see the future.. i don't understand why they don't talk about it more..their silence is deafening
Dr.Cos The was a community voting process for the submissions. I think there were 38 submitted talks all together. Can't tell more really...
John Furrier
question for @vnivargi I was talking to Sharmila at @clearstorydata and she is very bullish on Spark .. obviously they are analytics for the BI consumer ..why are you guys working with Spark & give us a taste of the results
Vaibhav
Spark has been working very well for us. The expressive power of Scala and the flexibility of RDDs works well for the interactive workloads our customers are seeing
Crowd Captain is the speed advantage really a big deal
John Furrier I like resilient distributed datasets but is Scala the requirement for programming - what about python?
Vaibhav
Our backend stack is implemented in Scala, so there is a natural fit. I've spoken to folks who use the Python and Java bindings as well
Vaibhav
Low latency is very important for our workloads, so is the fault tolerance and lineage of RDDs
Dr.Cos
It is new generation of data analytic platform. MR is batch and slow. New applications need interactive system to churn models quickly. That's why spark is gaining popularity so quickly. Commercial companies are backing it up: first @WANdisco, now others
Dean of Big Data
What is Spark's relationship with YARN?
Will Davis Support for running Spark on YARN was added to Spark in version 0.6.0, and improved in 0.7.0 & 0.8.0 - http://bit.ly/1bSvsTE
theCUBE There's no relation between Spark and YARN. The latter is a Hadoop resource scheduler. Spark supports YARN and can work as a YARN application though. But the same way it works with AMPlab Mesos https://www.crowdchat.net/post/4903
Crowd Captain
is the use case of Spark only limited to social data or only Graph DBs and Machine Learning environments?
Stephanie McReynolds We're using Spark across a wide variety of use cases. Social analysis is there but more so, supply chain ditribution, localized market demand, and a host of others. Anytime exploratory analysis is key. @CrowdCaptain
John Furrier
Why is the distinction between the use-cases for "realtime analytics" and real-time query serving so important?
Dr.Cos
Actually "Real-time" has a very specific SLAs. There's really no such thing as real-time analytic. Even HBase isn't real-time. However, in-memory systems are highly advantageous because of the performance. The importance lays in the speed too.
Jeff Frick Always enjoy the "real time" discussion and definition. At what point is "Once per unit time" good enough when unit time is greater than 0? Usually find a unit that provides value, far north of 0. #RealValue
theCUBE Hbase isn't real time and some are moving to #AWS now with #kenesis and #redshift offer compelling closed loop data for real time..not saying they are real time but have great queuing stack
Stephanie McReynolds Agreed that "real-time" is an overused term. Even algo traders argue about what is really "real-time". Better to look at right time for the use case. But everyone wants speed...
Stephanie McReynolds
You have to query before you analyze. So having access to both in one system is key. Querying in real-time is often easier than analyzing in real-time @furrier
Scott Howser
Stephanie's point below is right on. I would add that the term "realtime" has many different meanings and expectations depending on the use case, industry, application, etc. In my experience a completely over-utilized term. Focus on SLA per application!
Dr.Cos
Another great advantage of Spark that it isn't really Hadoop specifc and is Hadoop agnostic (in terms of versions). It supports pure open source Hadoop implementation and such offerings and CDH
Crowd Doc
Looking forward to someone offering "Spark as Service". As a startup our "resource scheduler" demands, we don't spend much time patching and updating all these separate components ;)
Jeff Kelly
that's important, so as not to limit applicability in the enterprise - the #hadoop battle is still being waged
Dr.Cos Battle is winding down, but there's still a few vendors that aren't really compatible with each other, unfortunately. Perhaps, Spark can be that sort of join point for them :)
John Furrier
being Hadoop agnostic is HUGE for this effort - kudos to @cloudera for supporting this new direction
John Furrier @theCUBE https://twitter.com/c0sin/status/405388210336845824
c0sin
@furrier @cloudera @WANdisco - the pioneer of commercial support for Spark is support the summit, BTW ;)
5 minutes ago
Vaibhav
Spark also enables working with the rest of the BDAS stack, enabling Graphx, MLBase and Spark Streaming very easily
Jeff Kelly
What is the state of the Spark community/ecosystem? Who are the main vendors supporting the community? Databricks, WANdisco, Cloudera ... who else? Intel? Yahoo?
Dr.Cos
Community wise, Spark has became ASF incubator project not that long ago and already did first incubation release! The community is vibrant and being roughly 3 years into development is over-passing that of Hadoop at the same point
John Furrier Spark is my opinion is the next big wave in big data bc it advances the mission of MR and extends the market - rising tide baby!! floating the business & tech value boats!! #bigdata #hadoop
John Furrier
AWS is a sponsor so I suspect they will be integrating Spark in their cloud stack.. no mention of Pivotal/VMware/EMC ..
Jeff Kelly AWS will no doubt make Spark available if that's what its customers want
Dr.Cos
The set of topics covered at the Summit is very illustrative on the state of the community: http://spark-summit.org/agenda/
John Furrier
What happens when a node crashes in Spark? Is the data replicated over the network or is it persisted in memory? re: data collections - thoughts from #techathletes here?
Scott Howser
What about intermediate results in complex queries? Are they persisted or mem based?
Vaibhav
RDDs in Spark are fault tolerant, so a node crashes the same 'operations' can be reapplied to re-derive the RDD. Extends to Tachyon as well
John Furrier that is what I was looking for thx for the data there :-)
Dr.Cos Yup, still the master node isn't fault-safe
Dr.Cos
Spark has a notion of fault-tolerance at RDD level - by design of it. However, high-availability solutions would be very beneficial for Spark master node and Spark context. That's where systems like @WANdisco is offering will be crucial.
Dr.Cos And BTW: Spark would of course benefit highly from high or continuous availability of HDFS (selfish plug: go @WANdisco)
Dr.Cos
There's no relation between Spark and YARN. The latter is a Hadoop resource scheduler. Spark supports YARN and can work as a YARN application though. But the same way it works with AMPlab Mesos
Crowd Doc
What are the advantages/disadvantages of deploying Spark with Hadoop YARN ? Which one is the first class citizen as a resource manager for Spark ? YARN or Mesos ? cc/ @spark_summit
Dr.Cos YARN suffers from higher scheduling latency compared to Mesos. Which is fine for most of the Hadoop applications, but is critical for Spark and alike.
John Furrier
What are the advantages/disadvantages of deploying Spark with Hadoop YARN? What are the advantages/disadvantages of deploying Spark with Mesos? Which has better cluster mgt?
Dr.Cos
On-Mesos deployment is very beneficial as it provides much faster scheduling than YARN. Also, with Mesos you can make Hadoop and Spark clusters coexist on the same infra.
John Furrier
Why is Spark so important and successful gain traction with developers in #bigdata
Stephanie McReynolds
For #bigdata analytics to have business impact, performance at the sped of thought is key. Spark delivers on fast, iterative analysis. @furrier
Jeff Frick "Performance at the speed of thought" good one.
Jeff Kelly
I see Spark as part of the larger evolution of #Hadoop from batch to multi-applications, in this case real-time analytics
theCUBE
developers want a unified platform to abstract away complexities emerging for innovation ontop of MapReduce bc as @slangenfeld pointed out multipass, ad hoc queries, and interactivity are now table stakes speed and ease of programming key
Subash D'Souza
How do real time databases such as HBase fit into the Spark ecosystem. If not, is there any work being done on migrating or creating something similar
Will Davis
Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc) - http://bit.ly/1fGp74L
theCUBE
not sure people view hbase as real time
Subash D'Souza i agree and it depends on how one would define real time. IMO anything with a quick response time is a valid case for it. :-)