More than a numbers game: the Open Data Science Conference in Boston


 

Is data science the "sexiest job on the planet"? Josh Wills from big data solutions provider Cloudera wants you to believe it isn't – so not everybody will try to get on that train. Or will data scientists "go extinct" in the near future, because ever smarter tools will automate expert-level tasks currently carried out by humans? Alex Cosmas from technology consulting firm Booz Allen Hamilton raises this question, alluding to a recent poll by KDnuggets.

 
 
Boston Convention Center

Boston Convention Center

 
 

These two hyperboles from the opening keynote talks at the first Open Data Science Conference in Boston (May 30-31, 2015) give an impression of how far advanced the field of data science already is, at least in the US. About 1,250 working and aspiring data scientists and some of the most prominent figures in the field gathered on this hot and humid early summer weekend in Boston to discuss and advance the use of open source in data science.

But what's so special about data scientists, anyway? Again quoting Josh Wills: "a data scientist can (1) ask great questions and (2) answer them faster than anyone else." In an information economy, these skills are in high demand. And there are the paychecks to prove it.

The conference gave a good overview of the variety of domains where data science is being applied - from enabling state-of-the-art business journalism at Dow Jones to transitioning from reactive to proactive health care, from metrics-driven development at Spotify to exposing abuses of power in the public and private sector.

 
 
"Data scientist at home" (adapted from Josh Will’s presentation)

"Data scientist at home" (adapted from Josh Will’s presentation)

 
 

As the focus of the conference was on open source, there were quite a few talks on the current state of the landscape of open source tools for data science. Wes McKinney, creator of the widely used Python library pandas for data analysis pointed out the importance of having universally accessible data formats. And Gael Varoquaux and Andreas Müller, co-authors of Python machine learning toolbox scikit-learn, gave insights into the open-source development process.

Having struggled with the shortcomings of the classical approach of statistical significance testing myself in my scientific career, I was particularly excited about Allen Downey's talk on Bayesian thinking. (Check out his slides and his free e-book Think Bayes). In a very clear and entertaining manner, he debunked the myths surrounding Bayesian statistics, making the point that Bayesian methods are not only much more accessible than commonly believed, especially when taking a computational approach. More importantly, Bayesian methods help you answer better questions than classical p-value significance tests.

 
 
Frequentist statistics Venn diagram (adapted from Allen Downey’s presentation and Ted Bunn's blog)

Frequentist statistics Venn diagram (adapted from Allen Downey’s presentation and Ted Bunn's blog)

 
 

Another personal highlight was the workshop on data visualisation and user experience by Bang Wong and Mark Schindler (see slides). Wong stressed the importance of taking the idiosyncrasies of human visual perception into account when visualising data. Color gradients can be particularly difficult to discern by the human eye, as our perception heavily relies on context and contrast. Gestalt principles such as proximity and likeness also have strong influences on our interpretation of the data. Schindler laid out his user experience framework for data visualisation. According to his model, a visualisation should always address the user goals. Insights emerge from the conversation between the questions asked by the user and the narrative told by the data.

So, is data science "sexy"? Well, there's obviously a high demand for smart people who can extract insights out of big messy piles of data. Pretty visualisations can even make those look good. Also, there was an encouraging share of women among the attendees of the conference, proving that you most certainly don't need to be a guy to be good with numbers.

But will data scientists "go extinct"? Not if they can get over their dependency on data, putting more weight on the "science" in their job description. Scientists cannot simply be data collectors and tool handlers. They need to come up with the right questions, keep inventing new methods to tackle them and make sense of the results. Or, as Alex Cosmas put it: "It's a badge of honour to be a scientist. We have to be truth seekers, not fact seekers."

 
 
 

Comment