
IBM Analytics79






























Q4: What are the some current hurdles with Spark machine learning and how are they being addressed?

Alexander Lang
I'd like to see even more and richer feature transformations. This is 80% of the work in predictive modeling.

Jean-François Puget
Are there any hurdle? Really? Just kidding.

Nick Pentreath
In my view these are: usability, scalability and deployment. Spark ML pipelines are great but still need improvements in areas of usability (APIs etc) and scalability (especially in terms of "wide" datasets with many features)...

Jean-François Puget
I guess main hurdle is that there isn't enough built-in feature engineer and algorithms....

Alexander Lang
As @JFPuget said: Spark is king when your problem is distributeable. One of my co-workers is looking at an NMF problem right now, wher "locality is king". This makes using Spark trickier

Mike Dusenberry
One of the big issues in Spark ML is deciding which machine learning algorithms to include. Adding more algorithms can be useful, but each one adds a large amount of code maintenance overhead to the project.

Jean-François Puget
... the fix is to add many more at each Spark release, including 2.0

Nick Pentreath
... a constant issue coming up with Spark ML users is deploying pipelines to production. While Spark supports save/load in Spark, this does not solve the problem of pushing your trained pipeline to a real-time, low-latency serving environment

Alexander Lang
No, @JFPuget and I didn't coordinate our answers beforehand.-)

Jean-François Puget
We need a standard to store pipelines and models directly from Spark and other ML frameworks.

Alexander Lang
@JFPuget Like PMML "2.0" ?

Yiannis Gkoufas
another hurdle in the big data frameworks(not specific to ML) is the challenge of efficiently test the jobs the developer is designing before applying to a huge dataset. You want tools to quickly identify the potential problems in the job

jameskobielus
@dusenberrymw I'd worry about Spark "code bloat" where you're adding yet another ML algorithm to an existing app. Here's a column I wrote a few years ago on this topic: http://www.ibmbigdat...

Jean-François Puget
@alexlang11 or PFA (Portable Framework For Analytics), a Json format

Mike Dusenberry
@MLnick Yeah "wide" datasets seem to be a more general problem with Spark.

Alexander Lang
@johngouf Spot on! We face this with a GraphX app we're building right now...

Jean-François Puget
@dusenberrymw I'd like to see all scikit-learn algorithms...

Alexander Lang
@dusenberrymw And while you're at it, start with matrix operations.-)

jameskobielus
@JFPuget In terms of stadnards for storing pipelines and models directly form Spark and other ML frameworks, Ben Lorica had a good article a few years ago here: https://www.oreilly....

Alexander Lang
@dusenberrymw Could you elaborate on the problems with "wide" datasets you've seen?

Yiannis Gkoufas
@alexlang11 Curious to have another open discussion about GraphX to hear your thoughts!

jameskobielus
Here's a column I wrote on standardized ML pipelines: http://www.ibmbigdat...

Mike Dusenberry
@alexlang11 Yeah I agree! Making it easier to write distributed ML algorithms based on matrix operations is the goal with the Apache SystemML project that sits on top of Spark.

Nick Pentreath
@alexlang11 I for one have seen issues with the OneHotEncoder on high-cardinality categorical features. Also the fact that many transformers only operate on one column is an issue.

Alexander Lang
@johngouf GraphX works, but it's the "works well on the developers laptop, runs into OOM for a very large dataset on the server" problem. We have to get better at "debuggability" of that...

Nick Pentreath
@dusenberrymw I actually think that the right approach is for that NOT to live in Spark core - I think the SystemML idea of building matrix math on top of the core primitives and focusing on that is making more and more sense

Mike Dusenberry
@alexlang11 The issue I've found with "wide" datasets is that the size of each row is much larger, and so the number of partitions that the data is split into needs to be much higher to prevent hitting Spark limits.

Mike Dusenberry
@dusenberrymw Additionally, I've found that having larger rows incurs a much higher cost to several DataFrame/DataSet operations.

Yiannis Gkoufas
@alexlang11 It was what I imagined as well....Encountered the same issues for SparkSQL, but in version 1.6.x

jameskobielus
@MLnick Speaking of scaling, here's a link to Nick's upcoming talk at Spark Summit Europe on scaling factorization machines on Spark using parameter servers: https://spark-summit...

jameskobielus
To learn how expert Spark developers are addressing these ML hurdles, attend the meetup in Brussels on 27 Oct: http://bit.ly/2ecmY7...