#SparkInsight - CrowdChat

More on 'Spark'

We are organizing a crowd chat to understand more about 'Spark' and how it can grow your business.

#sparkinsight @ibmbigdata @jameskobielus @louistcherian

LeaderBoard

#sparkinsightWhat is 'Spark'?We are organizing a crowd chat to understand more about 'Spark' and how it can grow your business.

#SparkInsightSpark : Building Smarter AppsJoin us to discuss about building smarter apps fueled by "Spark". Share your views and use cases.

Stream Ended

Establishing a secure connection

IBM Analytics59

Q #1 : What is Spark?

[+] Show Hidden Comments

Ira Michael Blonder

an Apache project focusing on server cluster architecture

Andrew C. Oliver

Spark is and in-memory distributed computing platform which essentially executes your functional Scala or Python code across the cluster as opposed to one machine. It also allows for micro-batching which tastes like streaming, but less filling.

Ira Michael Blonder

the architecture is claimed to lend itself to machine learning applications and "big data"

An API that abstracts MapReduce and makes distributed computation easy. More like transitiong from C/C++ -> Java

. @IBMbigdata Spark is a very big part of extending the value of #hadoop we had a big discussion on this during @theCUBE preproduction meeting with @wikibon research team

Ira Michael Blonder

I put big data in quotes to establish orig association of this term with map reduce, etc

An advanced analytics tool for in-memory distributed computation of machine learning, streaming, and graph analytics.

we posted a summary of the #hadoopsummit love fest and what it means to the market http://siliconangle....

@mikethebbop id say its a lot more than that

A hot new market in big data analytics that focuses on empowering data scientists with tools for a wide range of low-latency statistical analysis challenges.

. @IBMbigdata #hadoopsummit Spark is not yet ready for prime time,” said Wikibon’s Gilbert. Rather say: "Spark is still going through the process of being hardened that any large scale engine requires before mainstream adoption

. @IBMbigdata #hadoopsummit he Hadoop ecosystem needs to deliver real-time agile applications to support more interactivity and engagement data

An Apache project that leverages and builds on much of the core of Hadoop, especially HDFS, while adding new high-performance runtime engines that are optimized for real-time interactive statistical analysis and distributed computation.

Apache Spark is a powerful open source processing engine built around sophisticated analytics, speed, & ease of use

@furrier that to me, undersells its full potential.

Andrew C. Oliver

@furrier ..but Spark isn't and can't be the only tool in that toolbox.

Ira Michael Blonder

@JSHorwitz Thanks

A community of startups, ISVs, and established solution providers (e.g., IBM) doing exciting new things to empower a new generation of data scientists.

Spark is a in-memory distributed data processing engine, it is alternative to Map-Reduce framework. It have very rich data ingestion connectors and higher order functions for solve bigdata problems.

Great to see that Joel Horwitz is on the chat...had me worried there for a sec, Joel....chat away!

. @IBMbigdata #hadoopsummit Spark is amazing for in-memory and more importantly iterative computing. The key benefit it offers is caching intermediate data in-memory for better access times

Andrew C. Oliver

@furrier The snark in me wants to say: This is the tech industry, nothing is production ready until the month it goes obsolete, then after that it is too poorly maintained for production.

Ira Michael Blonder

@furrier The "hardening" pt is very important. I like the Wikibon opinion

Let us move on to Q #2 on top of your screen please.

Another summit, happening next week in San Francisco, at which I expect to see many of the Hadoop developers and users we're seeing this week at Hadoop Summit in San Jose. Actually (no surprise), there's plenty Spark content in sessions here

#hadoopsummit Some use cases where Shark outperforms Hadoop: 1) Real Time querying of data: in secs rather than minutes w Shark; 2) Stream processing: Fraud detection, log processing in live streams alerts, aggregates, analysis; 3) Sensor data processing

IBM Analytics39

Q #4 : How does Spark improve data scientist productivity?

[+] Show Hidden Comments

Spark improves data scientist productivity by enabling faster iterative development and refinement of statistical models fed by fresh continuous streams of low-latency data.

more algos more fun = data innovation

Spark improves data scientist productivity by enabling them to leverage their existing HDFS data in low-latency streaming and graph machine-learning projects for which MapReduce is not optimal.

Andrew C. Oliver

presuming we agree on what a data scientist is (a great math guy, poor businessman with rudimentary but poor python skills), you can execute code in your iPythonNotebook, see it run, see your graph, wash, rinse, repeat, 0 deploy time.

Elesin Olalekan

Almost missed on this crowd chat

. @IBMbigdata #hadoopsummit Traditional tools don't allow data scientists to impact the business. We see a huge opportunity to scale the work of data scientists. Easy access to data has to be as easy as using Lotus 123 in early PC days

The Spark REPL is an excellent way for data scientists to prototype solutions without having to submit code to the cluster all the time, leading to better feedback and iterative development

iterate faster to work difficult data into works of art

Spark improves data scientist productivity by deepening the library of advanced analytic algorithms available to them and facilitating creative blends of streaming, graph, and machine learning.

. @IBMbigdata I was got that last quote from @therealcojo yesterday in another crowdchat we did talking about how infrastructure #CLUSAnalytics impacts the knowledge worker

use all of your data with @apachespark this is sample free zone

Provides me with powerful tools that are extremely time-efficient, which allows me to focus on tackling big and beautiful problems

Spark accelerates data scientist productivity because, like any open-source community, it facilitates sharing of code, expertise, data, and other artifacts within and across teams in a way that is agnostic to the underlying hardware/OS platform.

ali khanafer you speak the truth!

Ira Michael Blonder

@ibmbigdata Whether spark "improves" productivity depends IMO on whether the organization has adopted the underlying server architecture considerations,

Andrew C. Oliver

By giving everyone such whimsical errors giving us nostalgia for "OLE Automation Error" :-D

Spark, like any growth market, accelerates everybody's creativity by bringing fresh young blood into the industry doing cool new things with these tools that established developers may not have considered.

Ira Michael Blonder

If it has not adopted the Hadoop premise, then it is not likely Hive will be around, etc

Ira Michael Blonder

So the organization needs to shift towards "big data", fundamentally

Thank @JSHorwitz :-) I actually read "spark the truth" XD .. too much spark lately!

Please take a loot at the next question on top of your screen on Spark as a big data analytics tool

IBM Analytics33

Q #2 : What are the important components of a Spark implementation?

[+] Show Hidden Comments

Please post your replies here.

Ira Michael Blonder

an app written in Java Scala or Python and Hive compatibility

The core of Spark: the runtime engines for core Spark, Spark Streaming, GraphX, and MLLib.

Ira Michael Blonder

Hive is a data warehouse application project fr Apache

Mr. Philip Russom of TDWI is standing here. Say something Phil: "buisnesses have to identify real-time requirements first or high-performance queries..then Spark's a natural."

data prep and ml pipelines. The rest will take care of itself.

Andrew C. Oliver

a good rational usecase, expertise in Python and/or Scala and expertise in managing a large cluster (or cloud deployment). A lot of memory doesn't hurt :-)

. @IBMbigdata #hadoopsummit Real Time querying of data and stream processing of course low latency data

clear separation of importing data (from SQL, HBase, etc) and distributed computation.

Spark can run in a standalone cluster mode or Apache Mesos

An underlying storage layer: often it's HDFS. The Spark codebase (the engines and libraries). Visualization and modeling tools. A high-performance distributed computing cluster.

Ira Michael Blonder

Question for Mr. Philip Russom: are businesses not presenting a lot of real time data requirements?

@jameskobielus spark is not real-time. Lets all agree on this for the sake of humanity :)

. @jameskobielus #hadoopsummit the building blocks will be commodity and the value is in the insight so systems of intelligence and cog native are huge value areas

key components of the Spark platform are Spark SQL, Spark Streaming, MLlib, and GraphX

Ira Michael Blonder

It would seem retail businesses would have a constant need for these types of solutions?

Data ingest and prep tools. Cluster management.

Andrew C. Oliver

Sadly at the moment usually 90% ETL and 10% everything else, despite what the sticker says.

You could say that savvy data scientists are the most important Spark "component" of all.

Thanks all! It would be great if you could please look for Q #3 on top of your screen.

Andrew C. Oliver

@jameskobielus I'd say that the underlying business culture change is the biggest obstacle. ATM we make irrational emotional and political decisions then back them up with data. We need to instead make decisions based on data and balance the rest.

Andrew C. Oliver

@jameskobielus I actually don't believe in data scientists as they are described by the industry.