SparkMachineLearning

Machine Learning on Spark
Join our crowd chat about Spark’s powerful machine learning capabilities, what’s new & coming..
IBM Analytics
Q4: What are the some current hurdles with Spark machine learning and how are they being addressed?
Alexander Lang
I'd like to see even more and richer feature transformations. This is 80% of the work in predictive modeling.
Jean-François Puget
Are there any hurdle? Really? Just kidding.
Nick Pentreath
In my view these are: usability, scalability and deployment. Spark ML pipelines are great but still need improvements in areas of usability (APIs etc) and scalability (especially in terms of "wide" datasets with many features)...
Jean-François Puget
I guess main hurdle is that there isn't enough built-in feature engineer and algorithms....
Alexander Lang
As @JFPuget said: Spark is king when your problem is distributeable. One of my co-workers is looking at an NMF problem right now, wher "locality is king". This makes using Spark trickier
Mike Dusenberry
One of the big issues in Spark ML is deciding which machine learning algorithms to include. Adding more algorithms can be useful, but each one adds a large amount of code maintenance overhead to the project.
Jean-François Puget
... the fix is to add many more at each Spark release, including 2.0
Nick Pentreath
... a constant issue coming up with Spark ML users is deploying pipelines to production. While Spark supports save/load in Spark, this does not solve the problem of pushing your trained pipeline to a real-time, low-latency serving environment
Alexander Lang
No, @JFPuget and I didn't coordinate our answers beforehand.-)
Jean-François Puget
We need a standard to store pipelines and models directly from Spark and other ML frameworks.
Yiannis Gkoufas
another hurdle in the big data frameworks(not specific to ML) is the challenge of efficiently test the jobs the developer is designing before applying to a huge dataset. You want tools to quickly identify the potential problems in the job
jameskobielus
@dusenberrymw I'd worry about Spark "code bloat" where you're adding yet another ML algorithm to an existing app. Here's a column I wrote a few years ago on this topic: http://www.ibmbigdat...
Jean-François Puget
@alexlang11 or PFA (Portable Framework For Analytics), a Json format
Mike Dusenberry
@MLnick Yeah "wide" datasets seem to be a more general problem with Spark.
Alexander Lang
@johngouf Spot on! We face this with a GraphX app we're building right now...
Jean-François Puget
@dusenberrymw I'd like to see all scikit-learn algorithms...
Alexander Lang
@dusenberrymw And while you're at it, start with matrix operations.-)
jameskobielus
@JFPuget In terms of stadnards for storing pipelines and models directly form Spark and other ML frameworks, Ben Lorica had a good article a few years ago here: https://www.oreilly....
Alexander Lang
@dusenberrymw Could you elaborate on the problems with "wide" datasets you've seen?
Yiannis Gkoufas
@alexlang11 Curious to have another open discussion about GraphX to hear your thoughts!
jameskobielus
Here's a column I wrote on standardized ML pipelines: http://www.ibmbigdat...
Mike Dusenberry
@alexlang11 Yeah I agree! Making it easier to write distributed ML algorithms based on matrix operations is the goal with the Apache SystemML project that sits on top of Spark.
Nick Pentreath
@alexlang11 I for one have seen issues with the OneHotEncoder on high-cardinality categorical features. Also the fact that many transformers only operate on one column is an issue.
Alexander Lang
@johngouf GraphX works, but it's the "works well on the developers laptop, runs into OOM for a very large dataset on the server" problem. We have to get better at "debuggability" of that...
Nick Pentreath
@dusenberrymw I actually think that the right approach is for that NOT to live in Spark core - I think the SystemML idea of building matrix math on top of the core primitives and focusing on that is making more and more sense
Mike Dusenberry
@alexlang11 The issue I've found with "wide" datasets is that the size of each row is much larger, and so the number of partitions that the data is split into needs to be much higher to prevent hitting Spark limits.
Mike Dusenberry
@dusenberrymw Additionally, I've found that having larger rows incurs a much higher cost to several DataFrame/DataSet operations.
Yiannis Gkoufas
@alexlang11 It was what I imagined as well....Encountered the same issues for SparkSQL, but in version 1.6.x
jameskobielus
@MLnick Speaking of scaling, here's a link to Nick's upcoming talk at Spark Summit Europe on scaling factorization machines on Spark using parameter servers: https://spark-summit...
jameskobielus
To learn how expert Spark developers are addressing these ML hurdles, attend the meetup in Brussels on 27 Oct: http://bit.ly/2ecmY7...
IBM Analytics
Q5: What is machine learning’s optimal niche within diversified big data analytics ecosystems?
Alexander Lang
It's a pretty big "niche" to be sure...
jameskobielus
Machine learning's optimal niche is as the core approach for automating more of the distillation of patterns from unstructured and complex data, both structured and streaming.
Alexander Lang
Full ack with @jameskobielus: By applying ML models, you can operationalize complex insights from data patterns: rules can't express that
jameskobielus
Machine learning's niche is to support development of apps that automatically "learn" from fresh data--in other words, it's the intelligence behind cognitive computing.
Nick Pentreath
it is a big "niche" - but essentially ML should be applied whenever (i) humans won't or can't scale to the problem (speed, data volume, complexity etc) and/or (ii) when the ML model can perform as well or better than the human.
Jean-François Puget
Machine Learning is all about making predictions on new data, and learn from wrong predictions.
jameskobielus
Machine learning's niche is to detect the data-driven environmental "signals" that drive artificial intelligence applications. Autonomous vehicles, for example, would be impossible without ML.
Jean-François Puget
Other analytics may stop once a model of input data is built.
Alexander Lang
Saw this today: http://idlewords.com...
"ML is like a deep-fat fryer. If you’ve never deep-fried something before, you think to yourself: "This is amazing! I bet this would work on anything!”
jameskobielus
Here's a recent blog I wrote on the "power apps of machine learning": https://www.linkedin...
Alexander Lang
@JFPuget "and learn from wrong predictions" is absolutely key! Otherwise, you get stuck in a negative reinformcement loop
Jean-François Puget
@alexlang11 I'd be careful. Often one combines rules and machine learning to take action.
Yiannis Gkoufas
Another untapped opportunity that I see, is how to support ML algorithms expressed with the same APIs to support a diverse variety of underlying nodes/slaves, like RaspPi, Nvidia boards, android devices...
Mike Dusenberry
ML is also particularly nice when the problem has a larger amount of variability that is perhaps difficult to encapsulate in a simple rule-based system.
Jean-François Puget
@MLnick .. and (iii) you have eexampels of what needs to be learned.
Alexander Lang
@JFPuget We actually do that as well! Let me officially retract from my statement above
Nick Pentreath
@JFPuget yes true! Though anomaly detection and some unsupervised techniques can still be applied :)
Mike Dusenberry
@JFPuget Yeah that's key in a more general sense. ML itself won't solve problems without definition; rather the application of ML to a problem is when it becomes useful.
Jean-François Puget
@MLnick Agreed, but these are not mainstream use of ML IMHO.
Jean-François Puget
@MLnick supervised learning is easier to leverage than unsupervised on to make decisions.
Mike Dusenberry
@JFPuget I think it's also easier for non-ML users to spot supervised problems -- essentially I want to be able to "predict this" given some data.
Jean-François Puget
@dusenberrymw yes, supervised problems are easier to define.