A singleton is a class designed to only permit a single instance. They have a bad reputation, but do have (limited) valid uses. Singletons present lots of headaches, and may throw errors when used with multiprocessing in Python. This article will explain why, and what you can do to work around it.
A Singleton in the Wild
Singleton usage is exceedingly rare in Python. I’ve been writing Python code for 5 years and never came across one until last week. Having never studied the singleton design pattern, I was perplexed by the convoluted logic. I was also frustrated by the errors that kept popping up when I was forced to use it.
The most frustrating aspect of using a singleton for me came when I tried to run some code in parallel with joblib. Inside the parallel processes, the singleton always acted like it hadn’t been instantiated yet. My parallel code only worked when I added another instantiation of the singleton inside the function called by the process. It took me a long time to figure out why.
Why Singletons Fail with Multiprocessing
The best explanation for why singletons throw errors with multiprocessing in Python is this answer from StackOverflow.
Each of your child processes runs its own instance of the Python interpreter, hence the singleton in one process doesn’t share its state with those in another process.
Your singleton instance won’t be shared across processes.
Working Around Singleton Errors in Multiprocessing
There are several ways to work around this problem. Let’s start with a basic singleton class and see how a simple parallel process will fail.
In this example, the singleton needs to do an empty instantiation inside your worker function because we want access to some attribute stored in the singleton. We don’t know what value to instantiate it with because that’s the very thing we’re trying to access from the attribute.
Here’s a simple solution I came up with that worked for me, and might for you as well. The solution here uses environment variables to store state across processes.
Pass Singleton as Argument
Another solution is to simply pass the instantiated singleton instance as an argument to the worker function.
Python has a robust ecosystem of data science packages. In this article, I’ll discuss the most popular Python packages for data science, including the essentials as well as my favorite packages for visualization, natural language processing, and deep learning.
Essential Python Packages for Data Science
The vast majority of data science workflows utilize these four essential Python packages.
I recommend using Anaconda to setup your Python environment. One of it’s many benefits is that it automatically installs these 4 libraries, along with many other essential Python packages for data science.
The fundamental package for scientific computing with Python.
The Pandas dataframe is the primary data object for most data science workflows. A dataframe is basically a database table, with named columns of different data types. Numpy ndarrays, in contrast, must have the same data type for each element.
Simple and efficient tools for predictive data analysis
Scikit Learn is the workhorse for machine learning pipelines. It’s built on top of Numpy and Matplotlib, and plays nice with Pandas. Scitkit Learn offers implementations of almost every popular machine learning algorithm, including logistic regression, random forest, support vector machines, k-means, and many more.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Matplotlib is the foundational data visualization package for Python. Pandas and Scikit Learn both have handy visualization modules that rely on Matplotlib. It’s a very flexible and intuitive plotting library. The downside is that creating complex and aesthetically pleasing plots usually requires many lines of code.
My two favorite Python visualization packages for data science are both built on top of Matplotlib.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Scikit Plot provides one-liners for many common machine learning plots, including confusion matrix heatmaps, ROC curves, precision-recall curves, lift curves, cumulative gains charts, and others. Here’s a slideshow of examples.
Natural Language Processing Packages
Natural language processing (NLP) is my specialty within data science. There’s a lot you can accomplish with Scikit Learn for NLP, which I’ll quickly mention below, but there are two additional libraries that can really help you level-up your NLP project.
Spacy is really powerful, and in my opinion supersedes the NLTK package that used to be the gold standard for things like part of speech tagging, dependency parsing, and named entity recognition.
Spacy does all of those for you in one line of code without any NLP knowledge. And it’s extensible to add your own entities and meta data to spans of tokens within each document.
And the displacy visualization tool is just awesome. Check out this slideshow of examples.
🤗 Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
State-of-the-art language models such as BERT and GPT-3 are trained with a neural network architecture called “transformers.” The Transformers library by Hugging Face allows you to apply these pre-trained models to your text data.
By doing so, you can generate vectors for each sentence in your corpus and use those vectors as features in your models. Or if your task is one that Transformers supports, you can just apply a complete model and be done. They currently support the following tasks.
Extractive question answering
Named entity recognition
Deep Learning Packages
Deep learning in Python is dominated by two packages: TensorFlow and PyTorch. There are others, but most data scientists use one of these two, and the split seems roughly equal. So if you want to train a neural network, I recommend picking TensorFlow or PyTorch.
An end-to-end open source machine learning platform
I use TensorFlow for all of my deep learning needs. The Keras high-level wrapper, which is now incorporated into TensorFlow 2.0, was what sold me on TensorFlow. It’s intuitive, reasonably flexible, and efficient.
TensorFlow is production ready with the TensorFlow Serving module. I’ve used it in combination with AWS SageMaker to deploy a neural network behind an API, but I wouldn’t describe TensorFlow Serving as particularly easy to use.
An open source machine learning framework that accelerates the path from research prototyping to production deployment
I’ve never personally used PyTorch (except as a backend for the Transformers library), so I probably won’t do it justice here. I’ve heard it described as more popular in academia. That quote above from their website suggests they are trying to change that image.
But my understanding is that PyTorch offers a bit more flexibility for designing novel neural network architectures than does TensorFlow. So if you plan to do research on new architectures, PyTorch might be right for you.