Bodo and Pandas: How to Accelerate Large Data Analytics on Your Laptop

August 4, 2021

Niyousha Mohammadshafie

The Challenges of Pandas

Pandas is an open-source Python library for cleaning, transforming, manipulating, and analyzing data. Being powerful and user-friendly made Pandas become the most popular tool among data scientists. Despite Pandas’ irreplaceable functionalities, it struggles to meet the data scientists’ needs when the computations are heavy, or the data volume exceeds a single core’s capacity. In other words, the Pandas library is not built for speed and scalability. By default, Pandas uses a single CPU core to execute its functions which is sufficient for smaller datasets or less complicated calculations.

However, using only a single core for computation on a dataset with millions or even billions of rows can quickly become a disaster. The good news is Bodo has a solution for data scientists when Pandas does not meet their expectations. We believe data scientists should invest their time in solving the problems rather than becoming distributed systems experts.

What is Bodo?

Bodo is an innovative pythonic solution designed to accelerate Python analytics code (e.g., Pandas, Numpy, etc.) by automatically optimizing and parallelizing the computation across the CPU cores. Bodo utilizes highly sophisticated supercomputing techniques under the hood to accelerate the code; however, running a code through Bodo does not need any supercomputing knowledge. That means Bodo can boost the existing power of Pandas with only minimal code changes. When the data size is small enough to fit on a single node, Pandas functions are totally sufficient for data manipulation. As the volume of data grows or the computations become intense, Bodo’s advantages become more evident. See Bodo in action.

Examples of Bodo solving Pandas’ challenges

Bodo helps data scientists to work with big data and offers exceptional acceleration for intensive computation jobs. Unlike other solutions, Bodo’s learning curve is quite smooth if you already are comfortable with Python and its data science libraries. In most cases, the only thing that needs to be added to the original code is a Bodo jit decorator. If you choose to run the codes in a Jupyter notebook as we do in this blog post, you also need to add a parallel magic command %%px.

Here, we want to show in a few examples how you can boost your Pandas’ performance using Bodo. The data that we are working with is the New York City (NYC) yellow taxi 2019 data sourced from the NYC taxi and limousine website. The yellow taxi datasets are originally in CSV format and divided by month. In this GitHub repository, we shared the notebook that shows how to concatenate the datasets pieces into one large data frame. This notebook also contains all the codes shared in this blog post, including the visualization part. In the examples below, we run a few computation-heavy functions with Pandas and with Bodo. We time the computations and compare them visually at the end. You can copy the code and run them in your local environment if you already have Bodo installed.

Load the data: The below code is an example of loading the NYC yellow taxi 2019 dataset into a Pandas Dataframe with and without Bodo. Due to the size of the data, reading the dataset through Pandas takes a long time. Reading the same code through Bodo on 4 cores cuts the loading time to less than one-fourth. That is because Bodo optimizes the code efficiently utilizes all cores and distributes workload across all of them.

‍Pandas

Reading time:  228.91828107833862 seconds

Pandas + Bodo

[stdout:0]
Reading time:  49.204203798999515 seconds

Group by: The groupby() function splits the data into groups based on some criteria. In this example, We want to compute some statistics about the rides that need the use of groupby(). (i) Do people prefer solo rides more than group ones? (ii) What are the common payment types? (iii) How does the fair amount change based on the number of passengers?
To answer these questions, we group the data by a column and count/average the rides in each group. As you can see from the outputs, Bodo decreases the computation time of the groupby() function to one-fourth.

‍Pandas‍

Execution time:  21.10689616203308 seconds

Pandas + Bodo

 [stdout:0]
 Execution time:  5.038841708999826 seconds

Filter data: In data filtering, we choose a smaller part of the data frame based on a condition. In this example, we want to have a dataset that only includes rides after May 2019. Therefore, we need to filter the data based on the pick-up date column and store it as a separate file in CSV format. Running the same code with Bodo gives 1.5x acceleration. Pandas

Execution time:  531.8111879825592 seconds

Pandas + Bodo

[stdout:0]
Execution time:  300.6326244680022 seconds

User-defined function: In Pandas, we use the apply() method to execute the user-defined function (UDF) on each row or column. While the apply() function is a flexible method, it does not take advantage of Pandas' vectorization. In other words, apply() acts like a Python loop that iterates through the rows one by one, making its performance extremely slow when executed on large datasets. Bodo automatically compiles and parallelizes the UDFs and makes them extremely fast. Here, we want to know the year, weekday, and hour that the driver picked up the passenger(s). We can extract this information from our pick-up DateTime column using a UDF. As you can see in the below example, running the code through Bodo is 50x more efficient than Pandas.

‍Pandas

Execution time:  1075.6187331676483 seconds

Pandas + Bodo

[stdout:0]
Execution time:  21.2050147129994 seconds

Performance Comparison

Using the same dataset and code, we just added a bodo.jit() decorator and also a parallel magic command %%px if you choose to run the codes in a Jupyter notebook environment. Pandas takes about 5x longer than Bodo’s community edition to run this overall task pipeline. Bodo saved us 25 minutes.

Total runtime for Pandas is 1855 seconds, and total runtime for Bodo Community Edition is 375 seconds. That is a ~5x performance improvement overall.

The runtime shared in this blog is the average of 5 runtimes. The runtime might be different depending on the laptop's features. We ran our codes on MacBook Pro 2020 with Intel Core i5 processors and 16 GB of memory.

Thank you for reading, I hope this blog post is useful to you and encourage you to see Bodo in action.

‍