Pandas is an open-source Python library for cleaning, transforming, manipulating, and analyzing data. Being powerful and user-friendly made Pandas become the most popular tool among data scientists. Despite Pandas’ irreplaceable functionalities, it struggles to meet the data scientists’ needs when the computations are heavy, or the data volume exceeds a single core’s capacity. In other words, the Pandas library is not built for speed and scalability. By default, Pandas uses a single CPU core to execute its functions which is sufficient for smaller datasets or less complicated calculations.
However, using only a single core for computation on a dataset with millions or even billions of rows can quickly become a disaster. The good news is Bodo has a solution for data scientists when Pandas does not meet their expectations. We believe data scientists should invest their time in solving the problems rather than becoming distributed systems experts.
Bodo is an innovative pythonic solution designed to accelerate Python analytics code (e.g., Pandas, Numpy, etc.) by automatically optimizing and parallelizing the computation across the CPU cores. Bodo utilizes highly sophisticated supercomputing techniques under the hood to accelerate the code; however, running a code through Bodo does not need any supercomputing knowledge. That means Bodo can boost the existing power of Pandas with only minimal code changes. When the data size is small enough to fit on a single node, Pandas functions are totally sufficient for data manipulation. As the volume of data grows or the computations become intense, Bodo’s advantages become more evident. See Bodo in action.
Bodo helps data scientists to work with big data and offers exceptional acceleration for intensive computation jobs. Unlike other solutions, Bodo’s learning curve is quite smooth if you already are comfortable with Python and its data science libraries. In most cases, the only thing that needs to be added to the original code is a Bodo jit decorator. If you choose to run the codes in a Jupyter notebook as we do in this blog post, you also need to add a parallel magic command %%px.
Here, we want to show in a few examples how you can boost your Pandas’ performance using Bodo. The data that we are working with is the New York City (NYC) yellow taxi 2019 data sourced from the NYC taxi and limousine website. The yellow taxi datasets are originally in CSV format and divided by month. In this GitHub repository, we shared the notebook that shows how to concatenate the datasets pieces into one large data frame. This notebook also contains all the codes shared in this blog post, including the visualization part. In the examples below, we run a few computation-heavy functions with Pandas and with Bodo. We time the computations and compare them visually at the end. You can copy the code and run them in your local environment if you already have Bodo installed.
Pandas + Bodo
Pandas + Bodo
Pandas + Bodo
Pandas + Bodo
Using the same dataset and code, we just added a bodo.jit() decorator and also a parallel magic command %%px if you choose to run the codes in a Jupyter notebook environment. Pandas takes about 5x longer than Bodo’s community edition to run this overall task pipeline. Bodo saved us 25 minutes.
Total runtime for Pandas is 1855 seconds, and total runtime for Bodo Community Edition is 375 seconds. That is a ~5x performance improvement overall.
The runtime shared in this blog is the average of 5 runtimes. The runtime might be different depending on the laptop's features. We ran our codes on MacBook Pro 2020 with Intel Core i5 processors and 16 GB of memory.
Thank you for reading, I hope this blog post is useful to you and encourage you to see Bodo in action.