#Hadoop4OpenInsights

IBM Analytics45

Q6: How does Spark complement Hadoop in the open data platform?

3 Votes Vote

[+] Show Hidden Comments

[-]

IBM Analytics

Please post your replies here

0 Votes Vote

[-]

jameskobielus

In the open data platform, Hadoop is the data-engineering "lake" (data acquisition, preparation, storage, integration, and governance) behind front-end modeling/visualizn tools/languages--Spark, R, etc.--used by data scientists.

3 Votes Vote

[-]

IBM Analytics

Please post your replies here

0 Votes Vote

[-]

mark simmonds

Ease of use. Adhoc analysis, speed to market, Abstracts complexities,. It's the killer app.

1 Votes Vote

[-]

jameskobielus

Spark is the development tool of choice for data scientists developing machine learning models with streaming analytics and graph analysis for in-memory execution.

2 Votes Vote

[-]

Andrew Popp

new processing power brought to Hadoop: batch and now streaming and interactive

2 Votes Vote

[-]

mark simmonds

HDFS is one file system. Spark can use many different types of file systems / data platforms

3 Votes Vote

[-]

jameskobielus

Hadoop, especially HDFS, is the core distributed data storage, refinement, preparation, and governance layer behind Spark.

1 Votes Vote

[-]

Mark van Rijmenam

As Spark is developed on top of HDFS, it works very well with Hadoop and it can be deployed on existing Hadoop clusters or work side-by-side. I wrote a white paper about this: http://floq.to/ZiBFq

4 Votes Vote

[-]

mark simmonds

But not dependent on Hadoop.

3 Votes Vote

[-]

Ira Michael Blonder

Spark includes the DAG engine which permits in-memory processing. The tool set also includes SparkSQL which may be more familiar to some Data Scientists & staff

3 Votes Vote

[-]

Arnab Ganguly

Spark complements and overcomes the primary problem that Hadoop has always suffered from - Slow batch processing which impacted real time processing/streaming. These use cases are very well supported.

5 Votes Vote

[-]

IBM Analytics

Last 4 minutes left, keep your replies coming

2 Votes Vote

[-]

Chris Surdak

spark is where #hadoop stops being a science project, or #BigBI and starts having an actual business use.

2 Votes Vote

[-]

IBM Analytics

Last 2 minutes left, keep your replies coming

0 Votes Vote

[-]

mark simmonds

Spark - in-memory, z13 perfect storm ?

2 Votes Vote

[-]

Aniruddha Joshi

The data on Hadoop can be processed either with its Map reduce component or with Spark as in-memory.

1 Votes Vote

[-]

Tina Groves

Agree with previous points that #spark enables more agile use of #hadoop, oriented to data scientist activities.

1 Votes Vote

[-]

Chris Surdak

@zbigdata a storm that finally catches up with #appified customers' expectations!

1 Votes Vote

[-]

Anil Saldanha

Spark provides replacement for mapreduce. Faster processing. Will still rely on HDFS.

2 Votes Vote

[-]

IBM Analytics

Last minute, help us wrap up!

1 Votes Vote

[-]

mark simmonds

Spark and ODP - Go to Strata conf and listen to the announcements

1 Votes Vote

[-]

Craig Brown, Ph.D.

Spark is a excellent contributor when MapR can't do the job or it gets to complicated. Spark is more user friendly and seems to have a little more flexibility compared to MapR.

1 Votes Vote

[-]

mark simmonds

ODP and Spark - a new beginning

1 Votes Vote

IBM Analytics23

Q5: How do you ensure that open data is trustworthy for downstream analytics?

3 Votes Vote

[+] Show Hidden Comments

[-]

Mark van Rijmenam

By by making sure that the right processes are in place within the organization to ensure the quality of the data

4 Votes Vote

[-]

Chris Surdak

assume that it is wrong, dirty, full of stuff that makes no sense, and then start from there. Remember, ETL is for suckers.

4 Votes Vote

[-]

IBM Analytics

@craigbrownphd you can post your replies here

1 Votes Vote

[-]

jameskobielus

Data profiling, data cleansing, data governance...the usual processes that you've built into your logical data warehouse. You need to bring those practices completely into your open-data-platform logical data warehouse.

2 Votes Vote

[-]

Arnab Ganguly

Open data must be passed through a rigorous transformation process.Thankfully a lot of algorithms can plug in erroneous / missing data. Unlike OLTP when it comes large scale analytics error tolerance is increased significantly.

2 Votes Vote

[-]

mark simmonds

Must have governance and data quality procedures in place

2 Votes Vote

[-]

Ira Michael Blonder

Once again, Chris Surdak had, I think, hit it. Good approach to a reality check at the start

0 Votes Vote

[-]

Olajide Oladayo Debby

By getting rid of unwanted data

0 Votes Vote

[-]

mark simmonds

Don't get rid of data - manage its retention.

1 Votes Vote

[-]

Andrew Popp

agree Chris S got it right .. use the process that works today (results may vary) ..

1 Votes Vote

[-]

jameskobielus

You also need to flag trustworthiness--golden vs. questionable--of all the data that you deliver to downstream analytics--e.g, if you're delivering social sentiment from zillions of twitter users, flag, don't oversell its quality

0 Votes Vote

[-]

mark simmonds

Nothing has changed - same rules apply as always. Just more data and different types available today.

0 Votes Vote

[-]

Aniruddha Joshi

Validate the previous use of open data from external sources.

1 Votes Vote

[-]

jameskobielus

You need data stewards on open data (e.g., social, geospatial, etc.) just as you should have data stewards on your "closed" (ie., proprietary) data (e..g, customers, products, etc.). Someone must account for open data quality.

0 Votes Vote

[-]

Tina Groves

All data, including open data, needs to conform to the organization's understanding of what that data represents.

0 Votes Vote

[-]

mark simmonds

Regardless of quality, ownership and security issues - organizations will still use what ever they have to gain insights.

0 Votes Vote

[-]

Ira Michael Blonder

Ultimately the answer to this question is complex. What may be "trustworthy" for one organization may be anything BUT "trustworthy" for the next. So IMS 2 have a #datagovernance plan in place 1st

1 Votes Vote

[-]

Ira Michael Blonder

and, of course, any/all stakeholders and/or groups within the organization should be included in the process of creating the #datagovernance plan

0 Votes Vote

[-]

Beate Porst

If you consider somehting "Open Data" you need to identify sensitive data and mask (or remove ) it before make data avilable as "open data".

0 Votes Vote

[-]

Aniruddha Joshi

Hypothesis driven focused data analysis would leave you with two many outliers I suppose.

0 Votes Vote

[-]

IBM Analytics

Please look at question #6

0 Votes Vote

[-]

Craig Brown, Ph.D.

Open data is only can become trustworthy once its transformed into usable data. It can be secured as a part of the transformation process and then added to corporate data for analytics.

1 Votes Vote

IBM Analytics13

Q4: What challenges do data scientists face in sourcing and integrating open data?

1 Votes Vote

[+] Show Hidden Comments

[-]

IBM Analytics

Please post your replies here

1 Votes Vote

[-]

Arnab Ganguly

Quality of the data and cleansing the data can be a significant challenge. Also connecting to open data can also lead to significant security risks. data processing algorithims must be able to tolerate failure because data formats may vary.

2 Votes Vote

[-]

Anil Saldanha

license,data scrub and correlations

0 Votes Vote

[-]

jameskobielus

One key challenge that data scientists face is simply discovering which open data sets are available to them in any given subject domain, and learning what the licensing and other restrictions may be on use of that data.

1 Votes Vote

[-]

Mark van Rijmenam

@connect_arnab Agree with you. Ensuring that the data is correct and can be mixed is vital if you want to gain some insights.

2 Votes Vote

[-]

jameskobielus

I blogged a while back on the various levels of openness, including data discovery, in an "open data" environment: http://www.linkedin....

0 Votes Vote

[-]

Ira Michael Blonder

@connect_arnab Agree with your points. Of course the credibility of the data SOURCE needs to be established, perhaps prior to even looking at a data sample

2 Votes Vote

[-]

Tina Groves

Most open data sets do not have a documented data model. So, interpretation of the data and its intended use are the first challenges.

1 Votes Vote

[-]

Chris Surdak

the fallacy of the #datalake. They're really #dataswamps. Integration isn't nearly so simple as "lake" implies.

2 Votes Vote

[-]

jameskobielus

Here's a discussion of mine on discovery open reference graphs for big-data analytics: http://www.linkedin....

1 Votes Vote

[-]

jameskobielus

Here's a discussion of mine on the need for open-data standards: http://www.linkedin....

0 Votes Vote

[-]

Chris Surdak

and #bigdata really isn't about "big" or about "data;" it's about #betterquestions.

0 Votes Vote

[-]

jameskobielus

Here's one I did on the challenges of monetizing big data, including open data: http://www.linkedin....

0 Votes Vote

[-]

jameskobielus

One of the huge challenges in integrating open data is the bewildering variety of formats, schemas, taxonomies, and the like: getting it all to coherent to a common set of semantics, metadata, tags.

0 Votes Vote

[-]

Chris Surdak

@jameskobielus you can't monetize data or insights, you can only monetize actions. No action, no money...

Time for question #5

With regards to Chris Surdak's point about "dataswamps": integration and, later, analytics, are both hobbled by a lack of tools for #OpenData comparable 2 #SQL

0 Votes Vote

[-]

IBM Analytics

Please look at the top of your screen for question #5\

0 Votes Vote

[-]

jameskobielus

@mikethebbop Yes, indeed. Data profiling--identifying the provenance and trustworthiness of the data--can become a huge swamp of unknowables the more open data you integrate from hither and yon.

0 Votes Vote

[-]

mark simmonds

Who owns the data. Do they have rights to use the data. Is it trustworthy - Same challenges as before - then some!

0 Votes Vote

Hadoop4OpenInsights

Invite People to #Hadoop4OpenInsights

1. Select Contacts

2. Compose Message

3. Send

Invite to #Hadoop4OpenInsights

Stream Ended

Extend Time Prompt

How many minutes would you like to add?