Daffodil International University

Faculty of Science and Information Technology => Software Engineering => Topic started by: Asif Khan Shakir on February 20, 2020, 08:18:04 PM

Title: The Data Science Puzzle — 2020 Edition
Post by: Asif Khan Shakir on February 20, 2020, 08:18:04 PM
With a new year upon us, let's take a fresh look at the current state of the data science puzzle. What are the most important constituent concepts of the data science landscape? How do they fit together? Which of these have been elevated in importance since the previous installment, and which are less important?

As a few years have passed since I last treated this particular topic, it might be worth having a look at this out of interest, and for comparison. We will proceed by first looking at the concept definitions from last time, and then look at how things have changed since then.

We start with the perceived original driver of the data science revolution, big data. What I said in 2017:

Big Data is still important to data science. Take your pick of metaphors, but any way you look at it, Big Data is the raw material that [...] continues to fuel the data science revolution.

As relates to Big Data, I believe that justification of data-acquisition and -retention from a business point of view, expectations that Big Data projects start providing actual financial returns, and the challenges related to data privacy and security will become the big Big Data stories not only of 2017 but moving forward in general. In short, it's time for big returns from, and big protections for, Big Data.

However, as others have opined, Big Data now "just is," and is perhaps no longer an entity deserving of the special attention it has received for the better part of a decade.

While I don't condone the capitlization of most key terms in general, "big data" seemed to previously demand this treatment given its near-fabled status and brand name-like station. Notice this time around I have reneged this status, which goes hand in hand with the idea that big data is no longer top level data science terminology. As alluded to in the final sentence, moving forward big data is simply "data," and we could reword part of that excerpt to read, "data is the raw material that continues to fuel the data science revolution."

Look, at this point we should all be aware of how important data is to the process of data science (it's right there in the name). Whether our data is big or small or lies somewhere else on the data sizing spectrum really doesn't require distinguishing from the outset. We all want to science the data and provide value, whether the data is a lot or a little. "Big data" may provide us with more or unique opportunities for the types of analytics and modeling to employ, but this seems akin to distinguishing the size of our nails from the get-go just so we know what size and type of hammer to bring along for a given job.

Data is everywhere. Much of it is big. It's time we stop emphasizing so, just like it's time we stop saying "smart" phone. The phones are all basically smart now, and making special note of it really says more about you than it does about the phone.

One thing I stand by, however, is that the challenges related to data privacy and security will only grow in importance as the years march on, and we can add ethics into that mix as well, though seriously treating these topics is beyond the scope of this article.

Here's what I said about machine learning as a component of data science last time:

Machine learning is one of the primary technical drivers of data science. The goal of data science is to extract insight from data, and machine learning is the engine which allows this process to be automated. Machine learning algorithms continue to facilitate the automatic improvement of computer programs from experience, and these algorithms are becoming increasingly vital to a variety of diverse fields.

I stand by this, and would only make the argument that machine learning is more than one of the primary technical drivers of data extraction, it is the the primary technical driver.

There are a variety of aspects to data science; we are discussing a number of them in this very article. However, when thinking about extracting insight from data which cannot be seen with the "naked eye" via descriptive statistics or the visualization of these stats or some type of business intelligence reporting — all of which can be very useful and provide invaluable illumination in the proper circumstance — machine learning is the natural path to take, a path which has automation baked in.

Machine learning is not synonymous with data science; however, given the reliance on machine learning to extract insight from data, you can forgive the many who often make this mistake.