Hadoop4OpenInsights

Hadoop for Open Insights
Getting Hadoop experts, to chat about how the future belongs to those who know how to use data
IBM Analytics
Q6: How does Spark complement Hadoop in the open data platform?
IBM Analytics
Please post your replies here
jameskobielus
In the open data platform, Hadoop is the data-engineering "lake" (data acquisition, preparation, storage, integration, and governance) behind front-end modeling/visualizn tools/languages--Spark, R, etc.--used by data scientists.
IBM Analytics
Please post your replies here
mark simmonds
Ease of use. Adhoc analysis, speed to market, Abstracts complexities,. It's the killer app.
jameskobielus
Spark is the development tool of choice for data scientists developing machine learning models with streaming analytics and graph analysis for in-memory execution.
Andrew Popp
new processing power brought to Hadoop: batch and now streaming and interactive
mark simmonds
HDFS is one file system. Spark can use many different types of file systems / data platforms
jameskobielus
Hadoop, especially HDFS, is the core distributed data storage, refinement, preparation, and governance layer behind Spark.
Mark van Rijmenam
As Spark is developed on top of HDFS, it works very well with Hadoop and it can be deployed on existing Hadoop clusters or work side-by-side. I wrote a white paper about this: http://floq.to/ZiBFq
mark simmonds
But not dependent on Hadoop.
Ira Michael Blonder
Spark includes the DAG engine which permits in-memory processing. The tool set also includes SparkSQL which may be more familiar to some Data Scientists & staff
Arnab Ganguly
Spark complements and overcomes the primary problem that Hadoop has always suffered from - Slow batch processing which impacted real time processing/streaming. These use cases are very well supported.
IBM Analytics
Last 4 minutes left, keep your replies coming
Chris Surdak
spark is where #hadoop stops being a science project, or #BigBI and starts having an actual business use.
IBM Analytics
Last 2 minutes left, keep your replies coming
mark simmonds
Spark - in-memory, z13 perfect storm ?
Aniruddha Joshi
The data on Hadoop can be processed either with its Map reduce component or with Spark as in-memory.
Tina Groves
Agree with previous points that #spark enables more agile use of #hadoop, oriented to data scientist activities.
Chris Surdak
@zbigdata a storm that finally catches up with #appified customers' expectations!
Anil Saldanha
Spark provides replacement for mapreduce. Faster processing. Will still rely on HDFS.
IBM Analytics
Last minute, help us wrap up!
mark simmonds
Spark and ODP - Go to Strata conf and listen to the announcements
Craig Brown, Ph.D.
Spark is a excellent contributor when MapR can't do the job or it gets to complicated. Spark is more user friendly and seems to have a little more flexibility compared to MapR.
mark simmonds
ODP and Spark - a new beginning
IBM Analytics
Q5: How do you ensure that open data is trustworthy for downstream analytics?
Mark van Rijmenam
By by making sure that the right processes are in place within the organization to ensure the quality of the data
Chris Surdak
assume that it is wrong, dirty, full of stuff that makes no sense, and then start from there. Remember, ETL is for suckers.
IBM Analytics
@craigbrownphd you can post your replies here
jameskobielus
Data profiling, data cleansing, data governance...the usual processes that you've built into your logical data warehouse. You need to bring those practices completely into your open-data-platform logical data warehouse.
Arnab Ganguly
Open data must be passed through a rigorous transformation process.Thankfully a lot of algorithms can plug in erroneous / missing data. Unlike OLTP when it comes large scale analytics error tolerance is increased significantly.
mark simmonds
Must have governance and data quality procedures in place
Ira Michael Blonder
Once again, Chris Surdak had, I think, hit it. Good approach to a reality check at the start
Olajide Oladayo Debby
By getting rid of unwanted data
mark simmonds
Don't get rid of data - manage its retention.
Andrew Popp
agree Chris S got it right .. use the process that works today (results may vary) ..
jameskobielus
You also need to flag trustworthiness--golden vs. questionable--of all the data that you deliver to downstream analytics--e.g, if you're delivering social sentiment from zillions of twitter users, flag, don't oversell its quality
mark simmonds
Nothing has changed - same rules apply as always. Just more data and different types available today.
Aniruddha Joshi
Validate the previous use of open data from external sources.
jameskobielus
You need data stewards on open data (e.g., social, geospatial, etc.) just as you should have data stewards on your "closed" (ie., proprietary) data (e..g, customers, products, etc.). Someone must account for open data quality.
Tina Groves
All data, including open data, needs to conform to the organization's understanding of what that data represents.
mark simmonds
Regardless of quality, ownership and security issues - organizations will still use what ever they have to gain insights.
Ira Michael Blonder
Ultimately the answer to this question is complex. What may be "trustworthy" for one organization may be anything BUT "trustworthy" for the next. So IMS 2 have a #datagovernance plan in place 1st
Ira Michael Blonder
and, of course, any/all stakeholders and/or groups within the organization should be included in the process of creating the #datagovernance plan
Beate Porst
If you consider somehting "Open Data" you need to identify sensitive data and mask (or remove ) it before make data avilable as "open data".
Aniruddha Joshi
Hypothesis driven focused data analysis would leave you with two many outliers I suppose.
IBM Analytics
Please look at question #6
Craig Brown, Ph.D.
Open data is only can become trustworthy once its transformed into usable data. It can be secured as a part of the transformation process and then added to corporate data for analytics.
IBM Analytics
Q4: What challenges do data scientists face in sourcing and integrating open data?
IBM Analytics
Please post your replies here
Arnab Ganguly
Quality of the data and cleansing the data can be a significant challenge. Also connecting to open data can also lead to significant security risks. data processing algorithims must be able to tolerate failure because data formats may vary.
Anil Saldanha
license,data scrub and correlations
jameskobielus
One key challenge that data scientists face is simply discovering which open data sets are available to them in any given subject domain, and learning what the licensing and other restrictions may be on use of that data.
Mark van Rijmenam
@connect_arnab Agree with you. Ensuring that the data is correct and can be mixed is vital if you want to gain some insights.
jameskobielus
I blogged a while back on the various levels of openness, including data discovery, in an "open data" environment: http://www.linkedin....
Ira Michael Blonder
@connect_arnab Agree with your points. Of course the credibility of the data SOURCE needs to be established, perhaps prior to even looking at a data sample
Tina Groves
Most open data sets do not have a documented data model. So, interpretation of the data and its intended use are the first challenges.
Chris Surdak
the fallacy of the #datalake. They're really #dataswamps. Integration isn't nearly so simple as "lake" implies.
jameskobielus
Here's a discussion of mine on discovery open reference graphs for big-data analytics: http://www.linkedin....
jameskobielus
Here's a discussion of mine on the need for open-data standards: http://www.linkedin....
Chris Surdak
and #bigdata really isn't about "big" or about "data;" it's about #betterquestions.
jameskobielus
Here's one I did on the challenges of monetizing big data, including open data: http://www.linkedin....
jameskobielus
One of the huge challenges in integrating open data is the bewildering variety of formats, schemas, taxonomies, and the like: getting it all to coherent to a common set of semantics, metadata, tags.
Chris Surdak
@jameskobielus you can't monetize data or insights, you can only monetize actions. No action, no money...
IBM Analytics
Time for question #5
Ira Michael Blonder
With regards to Chris Surdak's point about "dataswamps": integration and, later, analytics, are both hobbled by a lack of tools for #OpenData comparable 2 #SQL
IBM Analytics
Please look at the top of your screen for question #5\
jameskobielus
@mikethebbop Yes, indeed. Data profiling--identifying the provenance and trustworthiness of the data--can become a huge swamp of unknowables the more open data you integrate from hither and yon.
mark simmonds
Who owns the data. Do they have rights to use the data. Is it trustworthy - Same challenges as before - then some!