SpeedIdeation

9/15 CSC #Speedideation
Topic today is Software Engineering in #DataScience: http://wp.me/p4SqGB-3u
   10 years ago
#speedideation8/18 CSC #SpeedideationTopic today is innovation through simulation: http://bit.ly/1o2xnL5
   10 years ago
#SpeedIdeation9/22 CSC #SpeedideationTopic today is: Public Wifi for Business
Jerry Overton
What software engineering practices do we need in #DataScience?
Theyaa Matti
The current agile software development practices contain, standard naming, commenting, associated tests.
Kyle Zellman
Clear variable naming, looping/control flow, readability/reproducibility
Chris Fangmann
@khzellman Any ideas on standards / best practices?
Henry Helgen
@TheyaaMatti coding standards, naming conventions, test planning are essential software quality steps with any approach.
Kyle Zellman
@ChrisFangmann when it comes to naming--clear, descriptive names that allow others to understand what the code is doing. Readability/reproducibility--robust comments, using spacing and indenting to your advantage.
Fabien Gelineau
ability to mix gracefully several programming paradigms: procedural, object orientation and functional programming tied all together
Jerry Overton
You'd have to know some really flexible architecture patterns for that, right? Any suggestions?
Soren Helsted
Does anybody have examples on good patterns and anti-patterns, and are patterns important in #DataScience programming?
Chris Baker
Test-driven development is important in any context, no less so for data science. Understanding of how an algorithm works doesn't help if you've coded it incorrectly.
Jerry Overton
@ScaredOfGeese But TDD is fairly new in #DataScience right? How many us have experience with unit testing in Python or R?
Henry Helgen
Commenting for re-usability. Header comment with purpose, change log, inputs, outputs, exceptions, how to call. Each block comments that describe the "Why did you write it that way?"
Chris Baker
These assumptions should be explicitly stated and verified using tests that document and demonstrate the proper functionality of constituent components.
Sorin Costea
let's not forget that TDD is no silver bullet either
Jerry Overton
What about the practice of writing comments first. Too simple?
Soren Helsted
Like when we did pseudo code in comments before adding the real code?
Jerry Overton
Do data scientist really need to know how to code?
Theyaa Matti
I believe it is necessary, especially if you are to share your code with colleagues or outside communities
Chris Fangmann
That's a big YES for me - question is WHAT? and HOW?
Kyle Zellman
Yes, absolutely. I've found that my data wrangling/cleaning skills get better as I get better at coding, but I still have a ways to go.
Faisal Siddiqi
I'm convinced that data science has a fairly significant component of coding. I like Drew Conway's venn digram as a broad picture http://bit.ly/1fDSjp...
The Data Science Venn Diagram
On Monday I—humbly—joined a group of NYC's most sophisticated thinkers on all things data for a half-day unconference to help O'Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in eac...
Soren Helsted
How about knowing to to code effectively - to get your results quickly
Theyaa Matti
In terms of what to learn, it depends on what you are trying to accomplish. For #DataScience you will be looking at programming languages that contain statistical algorithms as built in
Faisal Siddiqi
we typically write code to consume data sources available as public/private APIs, mash them up, analyze and visualize
Theyaa Matti
One other thing, code reusable components
Lisa Braun
@Faisal_Siddiqi From Drew Conway's venn diagram blog post: "Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker."
Faisal Siddiqi
@LisaAnneBraun although "hacking" may have a negative connotation, I agree that iterative, experimental coding is a useful technique
Lisa Braun
@Faisal_Siddiqi Right - in the blog post "hacking" was not negative but used to mean able to manipulate code cleverly, whether or not you have formal CS training. When data is a commodity, you need hacking skills to unlock insights. (my interpretation)
Jerry Overton
Great works on the topic. Can anyone suggest a class, book, blog, etc that does a great job on the topic of SW Engineering in #DataScience?
Chris Fangmann
Should we create a central site where all can follow up on this question and keep adding info on classes, books, blogs etc?
Faisal Siddiqi
Data Science course on Coursera https://class.course... over half the material relates to software engineering practices
Coursera
Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for any...
Jerry Overton
@ChrisFangmann Definitely. I'd like to start by summarizing all the great comments we're getting here.
Chris Fangmann
Deal - I'll set this up tonight
Kyle Zellman
I'd love this. Looking for good resources in this area.
Henry Helgen
@ChrisFangmann Take a look at https://c3.csc.com/g...
CSC Global Pass - Login
Use of Electronic Communications Media By signing on, clicking OK or otherwise attempting to log on, access, or connect to a CSC network or system resource anywhere in the world, you are notified of the monitoring and inspection of all your electroni...
Chris Fangmann
@HenryHelgen Still will open an external site so partners and other guys interested can add and have access
Jerry Overton
@ChrisFangmann Definitely. This is outside in baby!
Chris Baker
Not specific to #DataScience, but this is a list compiled by a colleague for computational scientists and engineers: http://web.ornl.gov/... Probably the most important of these is "Code Complete"
Chris Fangmann
@ScaredOfGeese Excellent - thanks for sharing
Jerry Overton
@ScaredOfGeese Code Complete is an absolute classic. Its how I learned SW Engineering. But isn't it too heavy for data science?
Chris Baker
Maybe? Isn't the point to improve the data science curriculum?
Jerry Overton
Libraries, functions, or just to heck with it...how do you organize #MachineLearning code?
Steven Melanson
Libraries, functions, modules, etc. All incredibly important in a machine learning problem!
Chris Baker
If you don't code in libraries (of functions) to begin with, you'll probably have to refactor everything in order to distribute it and test it.
Chris Fangmann
How could one organize a global repository of libraries and functions?
Theyaa Matti
If you create a function to perform an operation, then you think this will help you in another program, you create a module and make that function as general as possible.
Steven Melanson
I would add in extendable classes as well
Jerry Overton
@ScaredOfGeese Exactly! And I'm not sure that these constructs are a part of the standard #DataScience education. Thoughts?
Theyaa Matti
All the libraries you use in R/Python are groups of functions
Faisal Siddiqi
Good software engineering practices emphasize re-using existing 3rd party, OSS, etc to minimize custom code. This is clearly why R Studio and Python are so useful for data science - lots of powerful libs
Fabien Gelineau
@ChrisFangmann question about organization is relevant but far less important than questions related to libraries or functions documentations and samples ... this is the key to get in action in minutes ...
Chris Fangmann
Is there a need to change #DataScience education? Add global OpenSource tools / libraries / best practices to it?
Jerry Overton
@ChrisFangmann It's a shame I can only vote for this once. I think that's a definite yes!
Chris Fangmann
Does this also mean IT companies need to be more active in building education foundation?
Fabien Gelineau
@all classes (OO) or closures - don't we need both as each of them bring advantages and drawbacks ?
Steven Melanson
Documentation is something that can be overlooked in projects but can end up being one of the most useful pieces down the road
Jerry Overton
@ChrisFangmann I think that the "computational thinkers" need a more active voice in #DataScience education, for sure.
Soren Helsted
@ChrisFangmann which education do #DataScience people have and is it harmonized globally? Sounds like most could do with an extra class in SW Engineering practices.
Theyaa Matti
I believe incorporating Sw engineering practices into #DataScience would help improve the field and how shared code is used within the community.
Chris Fangmann
Fully with you here - for me this means a more pro-active engagement in schools & universities as well as in cross-partnership education
Sorin Costea
@ChrisFangmann however school engagement is only long term perspectives... what is the short term?
Chris Fangmann
@sorincos create awareness of teams like the people here in this chat, get together, get started
Jerry Overton
@sorincos Actually, I think portfolio-based engagement or micro-tasks is both the short and long term solution.