Categories
Second Opinion

Maximizing Efficiency and Cost Control in PySpark

PySpark, a Python library integrated with Apache Spark, has revolutionized big data analytics with its speed, scalability, and efficiency. It offers a wide array of data transformation, analysis, and machine learning capabilities, making it a go-to tool for handling large datasets and real-time data streaming. However, as a data scientist, it is crucial to balance […]

Categories
DAGs Data Engineering

How to Improve a DAG

This post is part of a collaboration between Alisa Aylward of Alisa in Techland and Jared Rand of Skillenai. View Jared’s post on Alisa in Techland here. Discover the worst data pipeline ever and how to improve it’s DAG. Learn to remove cycles and handle dependencies efficiently. What is a DAG? DAG stands for directed […]

Categories
Coding Data Science Python

Singleton Fails with Multiprocessing in Python

A singleton is a class designed to only permit a single instance. They have a bad reputation, but do have (limited) valid uses. Singletons present lots of headaches, and may throw errors when used with multiprocessing in Python. This article will explain why, and what you can do to work around it. A Singleton in […]

Categories
Coding Data Science Python

Popular Python Packages for Data Science

Discover the most popular Python packages for data science, including essentials like Numpy and Pandas along with hidden gems like Seaborn and spaCy.