SparkInsight

More on 'Spark'
We are organizing a crowd chat to understand more about 'Spark' and how it can grow your business.
   9 years ago
#sparkinsightWhat is 'Spark'?We are organizing a crowd chat to understand more about 'Spark' and how it can grow your business.
   9 years ago
#SparkInsightSpark : Building Smarter AppsJoin us to discuss about building smarter apps fueled by "Spark". Share your views and use cases.
IBM Analytics
Q #1 : What is Spark?
Ira Michael Blonder
an Apache project focusing on server cluster architecture
Andrew C. Oliver
Spark is and in-memory distributed computing platform which essentially executes your functional Scala or Python code across the cluster as opposed to one machine. It also allows for micro-batching which tastes like streaming, but less filling.
Ira Michael Blonder
the architecture is claimed to lend itself to machine learning applications and "big data"
Ali Khanafer
An API that abstracts MapReduce and makes distributed computation easy. More like transitiong from C/C++ -> Java
John Furrier
. @IBMbigdata Spark is a very big part of extending the value of #hadoop we had a big discussion on this during @theCUBE preproduction meeting with @wikibon research team
Ira Michael Blonder
I put big data in quotes to establish orig association of this term with map reduce, etc
jameskobielus
An advanced analytics tool for in-memory distributed computation of machine learning, streaming, and graph analytics.
John Furrier
we posted a summary of the #hadoopsummit love fest and what it means to the market http://siliconangle....
Joel Horwitz
@mikethebbop id say its a lot more than that
jameskobielus
A hot new market in big data analytics that focuses on empowering data scientists with tools for a wide range of low-latency statistical analysis challenges.
John Furrier
. @IBMbigdata #hadoopsummit Spark is not yet ready for prime time,” said Wikibon’s Gilbert. Rather say: "Spark is still going through the process of being hardened that any large scale engine requires before mainstream adoption
John Furrier
. @IBMbigdata #hadoopsummit he Hadoop ecosystem needs to deliver real-time agile applications to support more interactivity and engagement data
jameskobielus
An Apache project that leverages and builds on much of the core of Hadoop, especially HDFS, while adding new high-performance runtime engines that are optimized for real-time interactive statistical analysis and distributed computation.
Ashok Nellikar
Apache Spark is a powerful open source processing engine built around sophisticated analytics, speed, & ease of use
Joel Horwitz
@furrier that to me, undersells its full potential.
Andrew C. Oliver
@furrier ..but Spark isn't and can't be the only tool in that toolbox.
jameskobielus
A community of startups, ISVs, and established solution providers (e.g., IBM) doing exciting new things to empower a new generation of data scientists.
Rahul Kumar
Spark is a in-memory distributed data processing engine, it is alternative to Map-Reduce framework. It have very rich data ingestion connectors and higher order functions for solve bigdata problems.
jameskobielus
Great to see that Joel Horwitz is on the chat...had me worried there for a sec, Joel....chat away!
John Furrier
. @IBMbigdata #hadoopsummit Spark is amazing for in-memory and more importantly iterative computing. The key benefit it offers is caching intermediate data in-memory for better access times
Andrew C. Oliver
@furrier The snark in me wants to say: This is the tech industry, nothing is production ready until the month it goes obsolete, then after that it is too poorly maintained for production.
Ira Michael Blonder
@furrier The "hardening" pt is very important. I like the Wikibon opinion
IBM Analytics
Let us move on to Q #2 on top of your screen please.
jameskobielus
Another summit, happening next week in San Francisco, at which I expect to see many of the Hadoop developers and users we're seeing this week at Hadoop Summit in San Jose. Actually (no surprise), there's plenty Spark content in sessions here
John Furrier
#hadoopsummit Some use cases where Shark outperforms Hadoop: 1) Real Time querying of data: in secs rather than minutes w Shark; 2) Stream processing: Fraud detection, log processing in live streams alerts, aggregates, analysis; 3) Sensor data processing
IBM Analytics
Q #4 : How does Spark improve data scientist productivity?
jameskobielus
Spark improves data scientist productivity by enabling faster iterative development and refinement of statistical models fed by fresh continuous streams of low-latency data.
Joel Horwitz
more algos more fun = data innovation
jameskobielus
Spark improves data scientist productivity by enabling them to leverage their existing HDFS data in low-latency streaming and graph machine-learning projects for which MapReduce is not optimal.
Andrew C. Oliver
presuming we agree on what a data scientist is (a great math guy, poor businessman with rudimentary but poor python skills), you can execute code in your iPythonNotebook, see it run, see your graph, wash, rinse, repeat, 0 deploy time.
Elesin Olalekan
Almost missed on this crowd chat
John Furrier
. @IBMbigdata #hadoopsummit Traditional tools don't allow data scientists to impact the business. We see a huge opportunity to scale the work of data scientists. Easy access to data has to be as easy as using Lotus 123 in early PC days
Ian Pointer
The Spark REPL is an excellent way for data scientists to prototype solutions without having to submit code to the cluster all the time, leading to better feedback and iterative development
Joel Horwitz
iterate faster to work difficult data into works of art
jameskobielus
Spark improves data scientist productivity by deepening the library of advanced analytic algorithms available to them and facilitating creative blends of streaming, graph, and machine learning.
John Furrier
. @IBMbigdata I was got that last quote from @therealcojo yesterday in another crowdchat we did talking about how infrastructure #CLUSAnalytics impacts the knowledge worker
Joel Horwitz
use all of your data with @apachespark this is sample free zone
Ali Khanafer
Provides me with powerful tools that are extremely time-efficient, which allows me to focus on tackling big and beautiful problems
jameskobielus
Spark accelerates data scientist productivity because, like any open-source community, it facilitates sharing of code, expertise, data, and other artifacts within and across teams in a way that is agnostic to the underlying hardware/OS platform.
Joel Horwitz
ali khanafer you speak the truth!
Ira Michael Blonder
@ibmbigdata Whether spark "improves" productivity depends IMO on whether the organization has adopted the underlying server architecture considerations,
Andrew C. Oliver
By giving everyone such whimsical errors giving us nostalgia for "OLE Automation Error" :-D
jameskobielus
Spark, like any growth market, accelerates everybody's creativity by bringing fresh young blood into the industry doing cool new things with these tools that established developers may not have considered.
Ira Michael Blonder
If it has not adopted the Hadoop premise, then it is not likely Hive will be around, etc
Ira Michael Blonder
So the organization needs to shift towards "big data", fundamentally
Ali Khanafer
Thank @JSHorwitz :-) I actually read "spark the truth" XD .. too much spark lately!
IBM Analytics
Please take a loot at the next question on top of your screen on Spark as a big data analytics tool
IBM Analytics
Q #2 : What are the important components of a Spark implementation?
IBM Analytics
Please post your replies here.
Ira Michael Blonder
an app written in Java Scala or Python and Hive compatibility
jameskobielus
The core of Spark: the runtime engines for core Spark, Spark Streaming, GraphX, and MLLib.
Ira Michael Blonder
Hive is a data warehouse application project fr Apache
jameskobielus
Mr. Philip Russom of TDWI is standing here. Say something Phil: "buisnesses have to identify real-time requirements first or high-performance queries..then Spark's a natural."
Joel Horwitz
data prep and ml pipelines. The rest will take care of itself.
Andrew C. Oliver
a good rational usecase, expertise in Python and/or Scala and expertise in managing a large cluster (or cloud deployment). A lot of memory doesn't hurt :-)
John Furrier
. @IBMbigdata #hadoopsummit Real Time querying of data and stream processing of course low latency data
Ali Khanafer
clear separation of importing data (from SQL, HBase, etc) and distributed computation.
Ashok Nellikar
Spark can run in a standalone cluster mode or Apache Mesos
jameskobielus
An underlying storage layer: often it's HDFS. The Spark codebase (the engines and libraries). Visualization and modeling tools. A high-performance distributed computing cluster.
Ira Michael Blonder
Question for Mr. Philip Russom: are businesses not presenting a lot of real time data requirements?
Joel Horwitz
@jameskobielus spark is not real-time. Lets all agree on this for the sake of humanity :)
John Furrier
. @jameskobielus #hadoopsummit the building blocks will be commodity and the value is in the insight so systems of intelligence and cog native are huge value areas
Avadhoot
key components of the Spark platform are Spark SQL, Spark Streaming, MLlib, and GraphX
Ira Michael Blonder
It would seem retail businesses would have a constant need for these types of solutions?
jameskobielus
Data ingest and prep tools. Cluster management.
Andrew C. Oliver
Sadly at the moment usually 90% ETL and 10% everything else, despite what the sticker says.
jameskobielus
You could say that savvy data scientists are the most important Spark "component" of all.
IBM Analytics
Thanks all! It would be great if you could please look for Q #3 on top of your screen.
Andrew C. Oliver
@jameskobielus I'd say that the underlying business culture change is the biggest obstacle. ATM we make irrational emotional and political decisions then back them up with data. We need to instead make decisions based on data and balance the rest.
Andrew C. Oliver
@jameskobielus I actually don't believe in data scientists as they are described by the industry.