#SparkMachineLearning

IBM Analytics79

Q4: What are the some current hurdles with Spark machine learning and how are they being addressed?

2 Votes Vote

[+] Show Hidden Comments

[-]

Alexander Lang

I'd like to see even more and richer feature transformations. This is 80% of the work in predictive modeling.

4 Votes Vote

[-]

Jean-François Puget

Are there any hurdle? Really? Just kidding.

3 Votes Vote

[-]

Nick Pentreath

In my view these are: usability, scalability and deployment. Spark ML pipelines are great but still need improvements in areas of usability (APIs etc) and scalability (especially in terms of "wide" datasets with many features)...

3 Votes Vote

[-]

Jean-François Puget

I guess main hurdle is that there isn't enough built-in feature engineer and algorithms....

3 Votes Vote

[-]

Alexander Lang

As @JFPuget said: Spark is king when your problem is distributeable. One of my co-workers is looking at an NMF problem right now, wher "locality is king". This makes using Spark trickier

3 Votes Vote

[-]

Mike Dusenberry

One of the big issues in Spark ML is deciding which machine learning algorithms to include. Adding more algorithms can be useful, but each one adds a large amount of code maintenance overhead to the project.

3 Votes Vote

[-]

Jean-François Puget

... the fix is to add many more at each Spark release, including 2.0

1 Votes Vote

[-]

Nick Pentreath

... a constant issue coming up with Spark ML users is deploying pipelines to production. While Spark supports save/load in Spark, this does not solve the problem of pushing your trained pipeline to a real-time, low-latency serving environment

3 Votes Vote

[-]

Alexander Lang

No, @JFPuget and I didn't coordinate our answers beforehand.-)

2 Votes Vote

[-]

Jean-François Puget

We need a standard to store pipelines and models directly from Spark and other ML frameworks.

5 Votes Vote

[-]

Alexander Lang

@JFPuget Like PMML "2.0" ?

1 Votes Vote

[-]

Yiannis Gkoufas

another hurdle in the big data frameworks(not specific to ML) is the challenge of efficiently test the jobs the developer is designing before applying to a huge dataset. You want tools to quickly identify the potential problems in the job

5 Votes Vote

[-]

jameskobielus

@dusenberrymw I'd worry about Spark "code bloat" where you're adding yet another ML algorithm to an existing app. Here's a column I wrote a few years ago on this topic: http://www.ibmbigdat...

2 Votes Vote

[-]

Jean-François Puget

@alexlang11 or PFA (Portable Framework For Analytics), a Json format

3 Votes Vote

[-]

Mike Dusenberry

@MLnick Yeah "wide" datasets seem to be a more general problem with Spark.

3 Votes Vote

[-]

Alexander Lang

@johngouf Spot on! We face this with a GraphX app we're building right now...

1 Votes Vote

[-]

Jean-François Puget

@dusenberrymw I'd like to see all scikit-learn algorithms...

3 Votes Vote

[-]

Alexander Lang

@dusenberrymw And while you're at it, start with matrix operations.-)

2 Votes Vote

[-]

jameskobielus

@JFPuget In terms of stadnards for storing pipelines and models directly form Spark and other ML frameworks, Ben Lorica had a good article a few years ago here: https://www.oreilly....

1 Votes Vote

[-]

Alexander Lang

@dusenberrymw Could you elaborate on the problems with "wide" datasets you've seen?

4 Votes Vote

[-]

Yiannis Gkoufas

@alexlang11 Curious to have another open discussion about GraphX to hear your thoughts!

3 Votes Vote

[-]

jameskobielus

Here's a column I wrote on standardized ML pipelines: http://www.ibmbigdat...

1 Votes Vote

[-]

Mike Dusenberry

@alexlang11 Yeah I agree! Making it easier to write distributed ML algorithms based on matrix operations is the goal with the Apache SystemML project that sits on top of Spark.

3 Votes Vote

[-]

Nick Pentreath

@alexlang11 I for one have seen issues with the OneHotEncoder on high-cardinality categorical features. Also the fact that many transformers only operate on one column is an issue.

2 Votes Vote

[-]

Alexander Lang

@johngouf GraphX works, but it's the "works well on the developers laptop, runs into OOM for a very large dataset on the server" problem. We have to get better at "debuggability" of that...

0 Votes Vote

[-]

Nick Pentreath

@dusenberrymw I actually think that the right approach is for that NOT to live in Spark core - I think the SystemML idea of building matrix math on top of the core primitives and focusing on that is making more and more sense

3 Votes Vote

[-]

Mike Dusenberry

@alexlang11 The issue I've found with "wide" datasets is that the size of each row is much larger, and so the number of partitions that the data is split into needs to be much higher to prevent hitting Spark limits.

3 Votes Vote

[-]

Mike Dusenberry

@dusenberrymw Additionally, I've found that having larger rows incurs a much higher cost to several DataFrame/DataSet operations.

1 Votes Vote

[-]

Yiannis Gkoufas

@alexlang11 It was what I imagined as well....Encountered the same issues for SparkSQL, but in version 1.6.x

2 Votes Vote

[-]

jameskobielus

@MLnick Speaking of scaling, here's a link to Nick's upcoming talk at Spark Summit Europe on scaling factorization machines on Spark using parameter servers: https://spark-summit...

3 Votes Vote

[-]

jameskobielus

To learn how expert Spark developers are addressing these ML hurdles, attend the meetup in Brussels on 27 Oct: http://bit.ly/2ecmY7...

1 Votes Vote

IBM Analytics43

Q5: What is machine learning’s optimal niche within diversified big data analytics ecosystems?

0 Votes Vote

[+] Show Hidden Comments

[-]

Alexander Lang

It's a pretty big "niche" to be sure...

1 Votes Vote

[-]

jameskobielus

Machine learning's optimal niche is as the core approach for automating more of the distillation of patterns from unstructured and complex data, both structured and streaming.

2 Votes Vote

[-]

Alexander Lang

Full ack with @jameskobielus: By applying ML models, you can operationalize complex insights from data patterns: rules can't express that

2 Votes Vote

[-]

jameskobielus

Machine learning's niche is to support development of apps that automatically "learn" from fresh data--in other words, it's the intelligence behind cognitive computing.

1 Votes Vote

[-]

Nick Pentreath

it is a big "niche" - but essentially ML should be applied whenever (i) humans won't or can't scale to the problem (speed, data volume, complexity etc) and/or (ii) when the ML model can perform as well or better than the human.

4 Votes Vote

[-]

Jean-François Puget

Machine Learning is all about making predictions on new data, and learn from wrong predictions.

4 Votes Vote

[-]

jameskobielus

Machine learning's niche is to detect the data-driven environmental "signals" that drive artificial intelligence applications. Autonomous vehicles, for example, would be impossible without ML.

3 Votes Vote

[-]

Jean-François Puget

Other analytics may stop once a model of input data is built.

1 Votes Vote

[-]

Alexander Lang

Saw this today: http://idlewords.com...
"ML is like a deep-fat fryer. If you’ve never deep-fried something before, you think to yourself: "This is amazing! I bet this would work on anything!”

3 Votes Vote

[-]

jameskobielus

Here's a recent blog I wrote on the "power apps of machine learning": https://www.linkedin...

1 Votes Vote

[-]

Alexander Lang

@JFPuget "and learn from wrong predictions" is absolutely key! Otherwise, you get stuck in a negative reinformcement loop

2 Votes Vote

[-]

Jean-François Puget

@alexlang11 I'd be careful. Often one combines rules and machine learning to take action.

1 Votes Vote

[-]

Yiannis Gkoufas

Another untapped opportunity that I see, is how to support ML algorithms expressed with the same APIs to support a diverse variety of underlying nodes/slaves, like RaspPi, Nvidia boards, android devices...

2 Votes Vote

[-]

Mike Dusenberry

ML is also particularly nice when the problem has a larger amount of variability that is perhaps difficult to encapsulate in a simple rule-based system.

4 Votes Vote

[-]

Jean-François Puget

@MLnick .. and (iii) you have eexampels of what needs to be learned.

3 Votes Vote

[-]

Alexander Lang

@JFPuget We actually do that as well! Let me officially retract from my statement above

0 Votes Vote

[-]

Nick Pentreath

@JFPuget yes true! Though anomaly detection and some unsupervised techniques can still be applied :)

2 Votes Vote

[-]

Jean-François Puget

@alexlang11 I'm sure you do ;)

1 Votes Vote

[-]

Mike Dusenberry

@JFPuget Yeah that's key in a more general sense. ML itself won't solve problems without definition; rather the application of ML to a problem is when it becomes useful.

1 Votes Vote

[-]

Jean-François Puget

@MLnick Agreed, but these are not mainstream use of ML IMHO.

1 Votes Vote

[-]

Jean-François Puget

@MLnick supervised learning is easier to leverage than unsupervised on to make decisions.

2 Votes Vote

[-]

Jean-François Puget

@dusenberrymw Exactly.