Categories
Motivation

System for Promoting Healthy Habits

I’m a goal-oriented person. When I started this blog about 2 months ago, I set a simple goal of growing the domain authority of Skillenai. To achieve my goal, I set about writing articles on Skillenai and other sites.

I got good feedback on my first few articles and landed some nice guest blog opportunities. I kept writing at a brisk pace for the first month in pursuit of my goal. I wrote 14 articles in 30 days.

Then I burnt out. I wrote 0 articles in the past 30 days.

Bursts of Passion

Goal-setting promotes bursts of passion that lead to burn out. The story of my first month writing for Skillenai is not unusual in my life. I could tell a dozen similar stories from my past.

How often have you fallen into this pattern: you set a goal that sparks some passion, you have a burst of motivation, then you’re left exhausted when you reach the goal (if you even make it that far)? For me, the answer is “too many to count.”

I’ve come to believe that this pattern not only yields poor results but is also bad for mental health.

Consistent Behaviors

While bursts of passion peter out, consistent behaviors fuel exponential growth. I started noting examples of this everywhere recently.

  • All the most successful bloggers, youtubers, and influencers preach about the importance of consistency.
  • Small bi-weekly 401k contributions can grow to over $1M over the course of your career. Meanwhile, day traders are more likely to go bankrupt than ever turn a profit.
  • I’ve been watering my “money tree” every weekend for 3 years. It went from a little pot on my window sill to this.
my money tree
I’ve been watering my “money tree” every weekend for 3 years.

Tool for Promoting Healthy Habits

Other than watering my money tree, I don’t have a lot of consistent behaviors in my life. To help me change that, I built a simple tool on Skillenai. The tool is free, but you need to register before using. Click the button below to try it out.

Here’s how it works.

  1. Missions – Start by adding a mission to group together your new healthy habits around a common purpose.
    • For me, my first mission is this:
    • “Do work that is challenging and fulfilling with exponentially growing rewards and impact.”
  2. Habits – Then add some habits to support your mission. Each habit has a set frequency such as weekly, monthly, quarterly, or yearly.
    • One of my habits is this:
    • “Write an article for the Skillenai blog once per week.”
  3. Activities – Use the activities calendar to track how consistently you follow through on your new habits.

This tool has motivated me to finally write a new article (this one) after over a month-long hiatus. And if all goes well, it will continue motivating me to write consistently every week without burning out.

What healthy habits would you like to add to your life?

Categories
Data Science Leadership

Build a Data Science Team from Scratch

Is your organization beginning it’s analytics or AI journey? One of the first steps will be building a data science team to help develop and deploy machine learning models. I was lucky to have the opportunity to be the first data scientist at a tech startup. In this article, I will share how we built a data science team from scratch.

Leadership Buy-in

The importance of executive sponsorship for your data science efforts cannot be overstated.

In my current role (not the tech startup), predictive modeling is core to our company’s value proposition. As a result, we have a large data science team relative to the size of the company and access to great compute resources.

But at the tech startup, predictive modeling was viewed by many as a nice-to-have. I made business cases to product managers for why they should add predictive components to their products. Most were interested, but early on none could find time on the roadmap given other priorities.

As the company and it’s product offerings expanded, eventually new senior leaders joined and key existing leaders began to believe data science was part of the critical path. From there, we finally got models in production and grew the team.

If your organization doesn’t have an executive sponsor, be persistent in building your business case but expect an uphill battle early on.

Infrastructure Prerequisites

Before data scientists can be productive, you need 1) data and 2) data engineers that can make it accessible. (Read this if you’re not sure about the difference between a data engineer and a data scientist.)

When I joined as the first data scientist, data was locked in a transactional database with prohibitively slow query times. An engineering team was in the middle of building a Spark-based data platform, but it was designed for customer reporting, not data science. So to get access to the data and do my work, I sat next to the data architect and learned Scala.

It wasn’t until the business intelligence team built a data warehouse in Snowflake when life as a data scientist got easy. None of my models made it to production until this data warehouse was built. Something worth noting here: the business intelligence team doubled in size to build and maintain the data warehouse.

Org Structure

There’s no easy answer to the question of where in your organization a data science team should live. Expect to go through some trial and error with your data science org structure in the early days. I personally had 7 different bosses in 2.5 years and went through 3 re-orgs.

Nonetheless, here are some popular org structures for data science.

  • In parallel with software engineers
    • If your data science team is building models for production, customer-facing systems, it probably should report into the CTO.
    • Data science is similar to DevOps in that it is separate from but in support of software engineering.
    • The languages and tools for data science are different from those of software engineering, so there needs to be some separation. But not too much, because software engineering will consume model outputs. Coordination will need to be tight.
  • Center of excellence
    • It’s common for data science teams to be at least partially centralized, with most team members reporting into a head of data science.
    • It’s important to have at least some level of centralization so data scientists can share knowledge with one another, do peer reviews, and collaborate during the creative stages of model development.
  • Embedded with product teams
    • Early on, I sat with a product team mixed with engineers, designers, and product managers. This approach failed for me because I was forced to use software engineering tools to deploy models, rather than the data science tools I was comfortable with. I decided to leave that team and join the business intelligence team to get more of the center-of-excellence benefits described above.
  • Internal modeling
    • If your data science team is building models for internal users (more of a consulting, business intelligence-type function but with machine learning), then it might not make sense for the data science team to report into engineering. But as the team scales and use cases expand into customer-facing products, engineering becomes a more natural place to live.

Team Composition

While there’s no prescription for optimum data science team composition, you should be aware of the many backgrounds and specialties within data science (and the many similar roles often confused with data science).

  • Researcher
    • If your product or solution approaches the known boundaries of machine learning, you may want to stack your team with PhD’s.
    • Examples of unsolved problems that require deep research:
  • Engineer
    • Most predictive modeling use cases do not exist on the boundaries of knowledge. They involve applying known algorithms to new contexts. Chances are, your use case falls into this category.
    • Build a team with solid foundations in statistics and programming. They will move quickly to find statistically-sound solutions using known approaches, and be capable of deploying to production.
  • Business
    • Users and product managers can’t always anticipate what is possible when machine learning is applied to your data. Have enough team members capable of working closely with the product team to assist with ideation, feasibility studies, and rapid prototypes when needed.
    • Data visualization and business communication are also crucial. Ensure you have enough team members who can tell a compelling story with data.
  • Specialties
    • Depending on your machine learning task and the type of input data you have, you may need to hire someone who specializes in one of the following.
      • Natural Language Processing (NLP) – text data
      • Computer Vision – image or video data
      • Deep Learning – massive, unstructured data sets
      • Personalization – product recommendations
      • Reinforcement Learning – robotics, navigation, finance
      • Fraud and Anomaly Detection – finance, web security, IoT
      • Time Series – finance, event data

Recruiting

Data science has consistently ranked as one of the best jobs in the US for many years. There’s high demand for data scientists and a stubborn shortage driven by the difficulty of acquiring all the requisite skills.

In my experience building a team from 1 to 6, you should expect the recruiting process to take anywhere from 3-6 months for each team member. The more senior the role, the longer it will take to fill.

It’s also wise to build the team incrementally vs hiring all at once. Your first hire may tell you that the data infrastructure isn’t ready yet and that you’re better off hiring more data engineers.

Whatever you do, do not hire a “Researcher” type as your first data scientist. You probably want a “Business” type data scientist who can work closely with the product team, effectively communicate challenges and progress, and evangelize data science in your org. The first 2 data scientists on our team (myself and another) both had MBAs. That business sense made all the difference in the early days.

Be Patient

Data science is inherently experimental. Even once you assemble your dream team, it may take many months (6+) before they discover and deploy a solution to your business problem.

The timeline from zero (no buy-in, no data infrastructure, no data science org structure, no data science team) to production can be extremely long. Depending on the size of your org and the maturity of its relationship with data, it could be 1.5 to 2 years.

One trick to help your team deliver value sooner, and maintain buy-in from your organization, is to ask them to knock out some quick wins. That might mean deploying a baseline model, or solving a more narrowly-scoped version of the problem.

But at the end of the day, the road to building a data science team from scratch is long, windy, and expensive. This is why AI vendors are so prolific and valuable. The build vs buy decision for AI products is a tough one, but I hope this article gives you a bit more of the information you need to make it.

Categories
DAGs Data Engineering

How to Improve a DAG

This post is part of a collaboration between Alisa Aylward of Alisa in Techland and Jared Rand of Skillenai. View Jared’s post on Alisa in Techland here.


Discover the worst data pipeline ever and how to improve it’s DAG. Learn to remove cycles and handle dependencies efficiently.

What is a DAG?

DAG stands for directed acyclic graph. In the sphere of data engineering, you will often hear DAG thrown around as though it is synonymous with a data pipeline; it is actually a more general mathematical concept. Data pipelines fit the definition of DAG, but not all DAGs are data pipelines. Data pipelines are DAGs because they are:

  • Directed, because there is a defined, non-arbitrary order of tasks
  • Acyclic, because they never make circles since that would mean they run forever
  • Graph, because their visual representation consists of nodes and graphs

A non-math example of a DAG is family trees (Wikipedia). So DAGs exist all around us, but what makes them good or bad in the world of data pipelining?

A note on terminology: mathematically, a DAG consists of nodes (the circles) and edges (the straight lines), but in this post, I will be referring to the vertices as “tasks”. This is what they are called in Apache Airflow, and this more accurately describes their function as a “unit of work within a DAG” (Source).

What is a bad DAG?

There is no definition for what makes a bad DAG, but there are attributes of a DAG that can cause inefficiencies. Jared designed the “worst data pipeline ever” to display some of the common pipeline design mistakes:

worst data pipeline

Image credit – Worst data pipeline ever

How do we improve a bad DAG?

Cycles

A DAG can be so bad that it is not even a DAG; in this case, the data pipeline has cycles so it is not acyclic. The cycles in this pipeline are highlighted below:

worst data pipeline cycles
Image by author – Worst data pipeline, with cycles

If this pipeline were coded in an Airflow DAG file, the Airflow webserver would not render it visually nor run it. It would display the error: “Cycle detected in DAG”.

It is hard to say how to fix these cycles without seeing the data, the flow of past_acquisition -> past_acquisition should be combined into one task. The one thing I would caution about is to not solve this by copying the past_acquisition twice. That cycle has to be eliminated by entirely refactoring the code to handle the update.

The same is true for the new_visitors -> past_visitors -> new_visitors. The cycles have to be eliminated within the code.

Dependencies

Although I don’t know the data associated with this DAG, it looks like the underlying architectural issue here is one of database design. Specifically, views, visitors, and campaigns are their own entities and should have some sort of physical manifestation before code is built on top of them to produce the reporting table acquisition_today. Assuming that the acquisition_today report was the one requested by stakeholders, it is natural to want to build the DAG around it. After all, what use is a DAG if it doesn’t produce something that the stakeholders want?

We do want to produce the report, but we want to do it as efficiently and scalable as possible. The inefficiency of this dag is easiest to explain with a hypothetical situation: if there is an error in views_today and we fix it, in order to update the report, we have to re-run all downstream tasks. This impacts all but three tasks: views_raw, visitors_raw, and campaigns_raw. This is time consuming (as stakeholders wait anxiously for their data) and compute intensive. If past_visitors is mostly visitor-based data with one or two fields from views, re-running past_visitors because of a views change is inefficient.

Additionally, not having a visits or campaign entity makes it hard to debug changes to acquisition report. If a stakeholder notices the report has a sharp increase in visits, can they quickly isolate those visits for quality assurance? If the report shows 30 visits yesterday and 100 today, can they find the new visit_ids easily to look it up in the source system? When all building blocks of a report are materialized to stakeholders, they have more agency to both understand and debug the data, taking work from the data engineering team.

Lastly, history tells us that stakeholders may want more than one report off this data or a change in reports. If stakeholders request a funnel_report, using the views data but not the campaign data, this DAG does not give us the flexibility to do that. Therefore, we want to build out all the underlying entities and then combine them at the very end.

What makes a good DAG?

Below is a good dag. Notice that:

  • There are no cycles
  • All entities are materialized independently (campaigns, visitors, views, etc) before being combined into the acquisition_report
  • We are able to add several reports (funnel_report, customer_profile) from the same entities
good data pipeline

Image credit – A better data pipeline

Categories
Coding Data Science Python

Singleton Fails with Multiprocessing in Python

A singleton is a class designed to only permit a single instance. They have a bad reputation, but do have (limited) valid uses. Singletons present lots of headaches, and may throw errors when used with multiprocessing in Python. This article will explain why, and what you can do to work around it.

A Singleton in the Wild

Singleton usage is exceedingly rare in Python. I’ve been writing Python code for 5 years and never came across one until last week. Having never studied the singleton design pattern, I was perplexed by the convoluted logic. I was also frustrated by the errors that kept popping up when I was forced to use it.

The most frustrating aspect of using a singleton for me came when I tried to run some code in parallel with joblib. Inside the parallel processes, the singleton always acted like it hadn’t been instantiated yet. My parallel code only worked when I added another instantiation of the singleton inside the function called by the process. It took me a long time to figure out why.

Why Singletons Fail with Multiprocessing

The best explanation for why singletons throw errors with multiprocessing in Python is this answer from StackOverflow.

Each of your child processes runs its own instance of the Python interpreter, hence the singleton in one process doesn’t share its state with those in another process.

https://stackoverflow.com/questions/45077043/make-singleton-class-in-multiprocessing

Your singleton instance won’t be shared across processes.

Working Around Singleton Errors in Multiprocessing

There are several ways to work around this problem. Let’s start with a basic singleton class and see how a simple parallel process will fail.

import time
from joblib import Parallel, delayed
class OnlyOne:
"""Singleton Class, inspired by
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html"""
class __OnlyOne:
def __init__(self, arg):
if arg is None:
raise ValueError("Pretend empty instantiation breaks code")
self.val = arg
def __str__(self):
return repr(self) + self.val
instance = None
def __init__(self, arg=None):
if not self.instance:
self.instance = self.__OnlyOne(arg)
else:
self.instance.val = arg
def __getattr__(self, name):
return getattr(self.instance, name)
def worker(num):
"""Single worker function to run in parallel.
Assume that this function has to do an empty
instantiation of the singleton.
"""
one = OnlyOne()
time.sleep(0.1)
one.val += num
return one.val
# Instantiate singleton
one = OnlyOne(0)
print(one.val)
# Try to run in parallel
# Will hit the ValueError that raises with
# empty instantiation
res = Parallel(n_jobs=-1, verbose=10)(
delayed(worker)(i) for i in range(10)
)
print(res)

In this example, the singleton needs to do an empty instantiation inside your worker function because we want access to some attribute stored in the singleton. We don’t know what value to instantiate it with because that’s the very thing we’re trying to access from the attribute.

Environment Variables

Here’s a simple solution I came up with that worked for me, and might for you as well. The solution here uses environment variables to store state across processes.

import time
from joblib import Parallel, delayed
import os
class OnlyOne:
"""Singleton Class, inspired by
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html
Modified to work with parallel processes using environment
variables to store state across processes.
"""
class __OnlyOne:
def __init__(self, arg):
if arg is None:
raise ValueError("Pretend empty instantiation breaks code")
self.val = arg
def __str__(self):
return repr(self) + self.val
instance = None
def __init__(self, arg=None):
if not self.instance:
if arg is None:
# look up val from env var
arg = os.getenv('SINGLETON_VAL')
else:
# set env var so all workers use the same val
os.environ['SINGLETON_VAL'] = arg
self.instance = self.__OnlyOne(arg)
else:
self.instance.val = arg
def __getattr__(self, name):
return getattr(self.instance, name)
def worker(num):
"""Single worker function to run in parallel.
Assume that this function has to do an empty
instantiation of the singleton.
"""
one = OnlyOne()
time.sleep(0.1)
one.val += num
return one.val
# Instantiate singleton
one = OnlyOne(0)
print(one.val)
# Run in parallel worry-free
res = Parallel(n_jobs=-1, verbose=10)(
delayed(worker)(i) for i in range(10)
)
print(res)

Pass Singleton as Argument

Another solution is to simply pass the instantiated singleton instance as an argument to the worker function.

import time
from joblib import Parallel, delayed
class OnlyOne:
"""Singleton Class, inspired by
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html"""
class __OnlyOne:
def __init__(self, arg):
if arg is None:
raise ValueError("Pretend empty instantiation breaks code")
self.val = arg
def __str__(self):
return repr(self) + self.val
instance = None
def __init__(self, arg=None):
if not self.instance:
self.instance = self.__OnlyOne(arg)
else:
self.instance.val = arg
def __getattr__(self, name):
return getattr(self.instance, name)
def worker(num, one):
"""Single worker function to run in parallel.
"""
time.sleep(0.1)
one.val += num
return one.val
# Instantiate singleton
one = OnlyOne(0)
print(one.val)
# Run in parallel succeeds when one is passed
# as arg to worker
res = Parallel(n_jobs=-1, verbose=10)(
delayed(worker)(i, one) for i in range(10)
)
print(res)
Categories
Blog Rankings

Designing a Blog Ranking System

Every year a slew of top blog lists are published for every topic under the sun (think “Top 10 Blogs in Data Science for 2020”). Most are manually curated, many are biased by the business interests of the publisher, few are data-driven, and none give you a voice. This article is my plan for solving these problems by delivering an unbiased, data-driven, and crowd-sourced blog ranking system for the data science community and beyond.

System Overview

Here’s how the system will work.

  1. Anyone submits a blog for consideration.
  2. Editorial team reviews submissions for quality.
  3. Site metrics are collected via API data pulls for metrics like domain authority and Alexa rank.
  4. Users vote for their favorite blogs on the rankings page.
  5. Ranking algorithm considers votes, site metrics, and editorial factors.

It’s pretty simple in theory. But how do we ensure enough votes are collected? And why is it useful for Skillenai to build and maintain this ranking system?

Count All the Votes

Collecting enough votes should be easy. This system will be self-reinforcing. Here’s how.

  1. A blog owner submits their blog for consideration.
    • Get new readers.
    • Build brand / personal reputation.
    • Easy backlink to their site.
  2. Blog owner encourages fans to upvote their blog.
    • Want to rank higher and get more traffic.
    • Blogger shares on social media.
    • Blogger links to rankings page.
  3. Other bloggers discover rankings page.
    • Thanks to promotion efforts of previous bloggers.
    • Cycle starts over.

And if that’s not enough, votes will also come from users of Skillenai’s products. Which brings me to the answer to the other major question, why it’s useful for Skillenai to maintain these rankings.

What’s In It For Skillenai

Skillenai’s core product, the Career Wizard, recommends learning resources based on skills and goals. Users of this product can upvote (and have other types of engagement with) various resources, including entire blogs. Each of those votes will be tied to their skills profile. This data enables powerful personalization of recommended learning resources.

These votes also make the blog rankings much more interesting. Rankings can be filtered on particular skills to help users discover the most relevant resources for them.

Categories
Coding Data Science Python

Popular Python Packages for Data Science

Python has a robust ecosystem of data science packages. In this article, I’ll discuss the most popular Python packages for data science, including the essentials as well as my favorite packages for visualization, natural language processing, and deep learning.

Essential Python Packages for Data Science

The vast majority of data science workflows utilize these four essential Python packages.

I recommend using Anaconda to setup your Python environment. One of it’s many benefits is that it automatically installs these 4 libraries, along with many other essential Python packages for data science.

Numpy

The fundamental package for scientific computing with Python.

https://numpy.org/

Numpy is foundational for data science workflows because of it’s efficient vector operations. The Numpy ndarray is a workhorse for mathematical computations in tons of useful libraries.

Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org/

The Pandas dataframe is the primary data object for most data science workflows. A dataframe is basically a database table, with named columns of different data types. Numpy ndarrays, in contrast, must have the same data type for each element.

Scikit Learn

Simple and efficient tools for predictive data analysis

https://scikit-learn.org/stable/

Scikit Learn is the workhorse for machine learning pipelines. It’s built on top of Numpy and Matplotlib, and plays nice with Pandas. Scitkit Learn offers implementations of almost every popular machine learning algorithm, including logistic regression, random forest, support vector machines, k-means, and many more.

Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

https://matplotlib.org/

Matplotlib is the foundational data visualization package for Python. Pandas and Scikit Learn both have handy visualization modules that rely on Matplotlib. It’s a very flexible and intuitive plotting library. The downside is that creating complex and aesthetically pleasing plots usually requires many lines of code.

Visualization Packages

My two favorite Python visualization packages for data science are both built on top of Matplotlib.

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

https://seaborn.pydata.org/

Seaborn makes beautiful statistical plots with one line of code. Here’s a few of my favorite examples.

A violin plot made with the Seaborn data visualization Python package can help you visualize the distribution of a variable for various slices. The violin plot displays a box and whisker plot along with a kernel density estimate of the distribution.
A joint plot made with the Seaborn data visualization Python package can help you visualize the joint distribution of two variables. The joint plot is similar to a scatter plot, but shows the density of values instead of individual observations. It also shows the univariate distributions on each axis.

Scikit Plot

There are a number of visualizations that frequently pop up in machine learning. Scikit-plot generates quick and beautiful graphs and plots with as little boilerplate as possible.

https://scikit-plot.readthedocs.io/en/stable/

Scikit Plot provides one-liners for many common machine learning plots, including confusion matrix heatmaps, ROC curves, precision-recall curves, lift curves, cumulative gains charts, and others. Here’s a slideshow of examples.

Natural Language Processing Packages

Natural language processing (NLP) is my specialty within data science. There’s a lot you can accomplish with Scikit Learn for NLP, which I’ll quickly mention below, but there are two additional libraries that can really help you level-up your NLP project.

Scikit Learn

CountVectorizer or TfidfVectorizer make it easy to transform a corpus of documents into a term document matrix. You can train a bag of words classification model in no time when you combine these with LogisticRegression.

Scikit Learn also provides LatentDirichletAllocation for topic modeling with LDA. I like to pair it with pyLDAvis to produce interactive topic modeling dashboards like the one below.

Screenshot of an interactive topic modeling dashboard generated by pyLDAvis.

Spacy

Industrial strength natural language processing

https://spacy.io/

Spacy is really powerful, and in my opinion supersedes the NLTK package that used to be the gold standard for things like part of speech tagging, dependency parsing, and named entity recognition.

Spacy does all of those for you in one line of code without any NLP knowledge. And it’s extensible to add your own entities and meta data to spans of tokens within each document.

And the displacy visualization tool is just awesome. Check out this slideshow of examples.

Transformers

🤗 Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

https://huggingface.co/transformers/

State-of-the-art language models such as BERT and GPT-3 are trained with a neural network architecture called “transformers.” The Transformers library by Hugging Face allows you to apply these pre-trained models to your text data.

By doing so, you can generate vectors for each sentence in your corpus and use those vectors as features in your models. Or if your task is one that Transformers supports, you can just apply a complete model and be done. They currently support the following tasks.

  • Sequence classification
  • Extractive question answering
  • Language modeling
  • Named entity recognition
  • Summarization
  • Translation

Deep Learning Packages

Deep learning in Python is dominated by two packages: TensorFlow and PyTorch. There are others, but most data scientists use one of these two, and the split seems roughly equal. So if you want to train a neural network, I recommend picking TensorFlow or PyTorch.

TensorFlow

An end-to-end open source machine learning platform

https://www.tensorflow.org/

I use TensorFlow for all of my deep learning needs. The Keras high-level wrapper, which is now incorporated into TensorFlow 2.0, was what sold me on TensorFlow. It’s intuitive, reasonably flexible, and efficient.

TensorFlow is production ready with the TensorFlow Serving module. I’ve used it in combination with AWS SageMaker to deploy a neural network behind an API, but I wouldn’t describe TensorFlow Serving as particularly easy to use.

PyTorch

An open source machine learning framework that accelerates the path from research prototyping to production deployment

https://pytorch.org/

I’ve never personally used PyTorch (except as a backend for the Transformers library), so I probably won’t do it justice here. I’ve heard it described as more popular in academia. That quote above from their website suggests they are trying to change that image.

But my understanding is that PyTorch offers a bit more flexibility for designing novel neural network architectures than does TensorFlow. So if you plan to do research on new architectures, PyTorch might be right for you.

Categories
Career Data Science

What Does a Data Scientist Do?

Have you been wondering what a data scientist does? This article will clarify what a data scientist does and does not do. We’ll show why data science is such a great career, and how it compares to similar roles.

Why Data Science Is a Good Career

Data science is one of the best careers for highly technical individuals. It offers a clear path to a salary above $100,000 within 5 years. It also offers a chance to solve challenging problems, work with multi-disciplinary teams, and make a big impact.

Similar Roles That Get Confused with Data Science

The skills required of a data scientist overlap with many other roles. But there are key differences between data science and these roles in both emphasis and level of expertise required for each skill.

Here are some similar roles that often get conflated with data science.

  • Business Intelligence Analyst
    • Similarities: Both roles require analyzing data and building predictive models.
    • Differences: BI analysts are more focused on visualization and dashboards. They are also more consultative, often working with internal stakeholders to deliver insights. In contrast, data scientists are more focused on building models for production use cases (i.e. for external customers).
  • Data Engineer
    • Similarities: Both roles require building extract, transform, load (ETL) pipelines. They also both require knowledge of databases and big data.
    • Differences: Data engineers are exclusively responsible for building and maintaining data pipelines and data stores. Data scientists, in contrast, may build their own data pipelines but only as a means to an end for their model pipelines.
  • Machine Learning Engineer
    • Similarities: Both roles require knowledge of machine learning software, along with tools for deploying models to production.
    • Differences: ML engineers are exclusively responsible for deploying and maintaining production models and pipelines. Data scientists are often expected to deploy their own models as well, at least in low-volume contexts. But for larger scale products or companies, data scientists may focus on research and development while ML engineers focus on production.
  • Software Engineer
    • Similarities: Both roles require knowledge of coding best practices.
    • Differences: Software engineers are experts in software, often at a much lower or more conceptual level than data scientists. But coding best practices are critical for every data scientist.
  • Research Scientist
    • Similarities: Both roles require scientific rigor and data analysis.
    • Differences: Research scientists generally perform research that pushes boundaries. Their work typically results in either publications or application of cutting-edge research to industry problems. Data scientists are focused on solving the highest-value business problems, which may or may not require novel approaches.
  • Data Analyst / Business Analyst
    • Similarities: Both apply analytical techniques to understand trends in data.
    • Differences: Data analysts generally perform descriptive analyses that summarize historical data. Data scientists generally perform predictive analyses that forecast future outcomes or predict behavior in never-before-seen circumstances.

What Data Scientists Don’t Do

Before we dive into what data scientists do, let’s talk about what data scientists don’t do.

  • SQL all day
    • SQL is a powerful language for extracting and analyzing data stored in relational databases. Every data scientist should be proficient in SQL. But if you’re writing SQL all day, you’re probably either a Data Engineer (building data pipelines) or a Business Analyst (pulling data for descriptive analysis).
  • Excel all day
    • Excel is powerful, too, and certainly has it’s place in some data science workflows. But Excel is terrible for automation and reproducibility. It also lacks machine learning capabilities, to the best of my knowledge. If you’re serious about entering the field of data science, start learning Python or R right away.
  • Prototyping all day
    • Prototypes are invaluable for proving a concept and getting buy-in for a new feature or product. But data scientists must inevitably write production code and deploy models to production. Contrary to the beliefs of some software engineers, not all data science code is prototype-quality.
  • Academic research
    • Some data science roles have research components and lead to publications. And most data scientists must spend a non-trivial proportion of time experimenting with approaches that may ultimately lead to dead ends. But data science is not an academic exercise; at the end of the day, the goal must be to deploy models to production (and to maintain until end of life).
  • Web development
    • It’s often useful for data scientists to be familiar with web architectures and frameworks. And in the prototyping phase, it’s not unusual for a data scientist to create a simple web app. But building websites or apps is not a core responsibility of most data scientists.

What Data Scientists Do

Now that we know what a data scientist doesn’t do, let’s finally answer the titular question: what does a data scientist do? There are two famous definitions of data science that I’ll introduce first.

The Tweet

The famous data science definition by Josh Wills

The Venn Diagram

The famous data science skill set Venn diagram

The Gist

The gist of both of these definitions is this: data science is a hybrid field combining elements of statistics, software, and business. It’s inherently multi-disciplinary.

It’s worth noting something that may be obvious. It’s difficult for individuals to be proficient in all three disciplines. And it’s valuable for a company to have one person capable of doing work in all three. This is what drives low supply and high demand for data scientists in the labor market, contributing to a stubborn shortage that has persisted for years.

The Details

Now let’s drill into more details about what, specifically, data scientists do.

Data Science Foundations

If I were to teach a data science bootcamp, I would cover the following foundational topics. (Similarly, if I was a student, I would look for a program with the following topics in the curriculum.)

  • Software Engineering
    • Python
      • Basics, numpy, pandas, scikit-learn, matplotlib
      • Coding standards – PEP 8, OOP, abstraction, modularity
      • Parallelization, profiling
    • Cloud
      • AWS – S3, EC2, SageMaker, Lambda
    • Deployment
      • Flask, SageMaker
      • Python packaging, dependency management
      • Testing
    • SQL
      • SELECT, JOIN, GROUP BY, WINDOW
    • Unix
      • Navigation, file manipulation, cron, screen, git
    • Git
      • Pull request workflow, code reviews
    • Development
      • Jupyter, IDEs – Atom, vscode
      • Repo templates
  • Modeling
    • Regression
      • Linear regression
    • Classification
      • Logistic regression
    • Clustering
      • K-means
    • Ensembling
      • Random forest, Gradient boosted trees
    • Other algorithms
      • Naïve Bayes
      • SVM
      • Collaborative Filtering
      • Multi-layer perceptron
    • Visualization
      • Matplotlib, seaborn
    • Big Data
      • Parallelization with joblib and multiprocessing
      • Memory limits
      • Spark
    • Model Selection
      • Pipelines
      • Hyperparameter tuning
      • Cross validation
    • Exploratory Data Analysis
    • Statistics
    • Hypothesis Testing
    • Model Evaluation
      • Metrics
      • Experimental design
    • Data quality checks
  • Business / Domain
    • Feasibility Studies
    • Choosing a target
    • Prototyping
    • Acquiring input data
    • Annotating data
    • Data Dictionaries
    • Agile
    • Assessing level of effort
    • Business communication
      • Your responsibility to make people understand
      • Executive summaries
      • Briefs
      • Setting expectations
    • Relationship with subject matter experts
    • Relationship with product managers
    • Relationship with customers / clients
    • Relationship with software engineers
    • Balancing short-term business value and research
      • Breakthroughs vs incremental progress
    • Helping business team understand uncertainty and non-linearity
      • Success not guaranteed, may be dependent on breakthroughs
    • Balancing interpretability and accuracy
    • Defining business metrics
    • Levels of maturity
      • Modeling efforts progress through different levels of maturity, each needs to be managed accordingly

Advanced Topics

Finally, let’s just mention some of the advanced topics in the field of data science that you may have heard as buzz words. Each of these topics could be considered a specialization within data science, to be mastered once you’re proficient in all of the foundations. I’ll leave descriptions of each for future posts.

  • Deep Learning
  • Natural Language Processing (NLP)
  • Computer Vision
  • ML Engineering
  • Big Data
  • Time Series
  • Fraud and Anomaly Detection
  • Personalization
  • Reinforcement Learning
  • Data Science Leadership

Become a Data Scientist

So those are the things a data scientist does every day. No one said it was easy. But I will say that it’s a ton of fun, and an extremely rewarding career.

There’s always more to learn, and that’s why Skillenai was born – to help every data scientist at every level continue growing their career.

Stay tuned as we add tutorials for all of the topics mentioned here and beyond.