A Gentle Introduction to Data Science

The words "Data Science" are not themselves sources of dread in most people. At just four and seven letters, respectively, they're almost too cute to be really off-putting like some of the other terms you come across when you begin digging into the field; terms like "k-nearest neighbors" or "tessellation."

And if you can hear the phrase "Euclidian minimum spanning tree" without feeling as though you've encountered something both bizarrely fascinating and deeply disturbing, you are a stronger intellectual force than I.


But while "Data Science" as a phrase may not inspire fear, neither does it really convey anything about what the field of Data Science actually does. I mean, everybody kind of knows what data is; at least in a loose sense. Data is raw bits of information. And "science" can be used to mean any of a group of activities following the scientific method. So, just linguistically, we can figure out that data science is going to be a field of study focusing on applying the scientific method to bits of information. But still, data science is what, exactly?


Wikipedia, that font of all truth, makes things clearer by defining data science as a field focused on "extracting knowledge or insights from data." Here's what wikipedia doesn't say, though: You are a natural data scientist.

In every waking moment of your life, you're observing the world around you. You're processing those observations into data, and using that data to understand the world around you, infer meaning, and make predictions for what will happen next.

When you're 30 minutes late getting ready to leave for work, and you call in to tell them you'll be working from home, you're using past observations of traffic patterns to predict that you'd lose more productive time stuck in traffic than you'd gain by being in the office. When you come into the living room to find the floor littered with small bits of silver foil and strips of paper, causal analysis will tell you the kids found your stash of Hershey's Kisses.

In either of the aforementioned cases, if you do those calculations in your head without really thinking about them, you're a normal human being. If on the other hand, you record those data points (often in some machine readable format) and then devise mathematical algorithms (aka procedures) and a set of computer programs to run them for you, the output of which tells you traffic will most likely suck and the kids ate your candy, then you are a Data Scientist. Good on ya! Add it to your LinkedIn!


So, now that you know what data science is, what can you do with it? The answer to that largely depends on what data you have and what you want to know. Textbooks in the field, like Peng and Matsui's excellent Art of Data Science, will tell you to first determine the question you want to answer and then explore the data you have available to you. For professional analysts and data scientists, I'd agree with that. For those new to data science, though, I recommend instead starting with identifying the data you have available, doing some exploratory analysis of that data, and letting that lead you to further questions.

Even the most minimal websites are treasure troves of data. For example, sentiment analysis can expose the general mood of the pieces of content on your site. Combining that information with basic site usage data provided by Google Analytics (or similar tools), you can see the content mood which your users find most appealing and begin to shape your content to fit that mood. The same site usage information can be compared against news and events in the broader world to reveal how outside factors might influence how users interact with the site. For instance, if your old blog about how to immigrate to Canada saw increased traffic during the last two US Presidential elections, this might be a good time to feature that more prominently.

Commerce sites, which have greater user interactivity, are even more bountiful in terms of the data they provide. For instance, who are your top purchasers? What categories of things is each user buying and when? If user 976 buys toys for girls age 9-11 every year in June, maybe you want to send them a mailer with offerings in that category in mid-May.

Data is everywhere. All you need to put it to use is a little curiosity, some freely available tools like R and SciPy, and a bit of reading. Find the data you have, let it lead you to the questions, and then science it until it reveals the answers!