Making pandas.DataFrame.apply Faster with Bodo

Niyousha Mohammadshafie

The apply function in Pandas is a game-changer tool as it eases the execution of user-defined functions on all the rows. But the issue with the apply function is that it can be unbearably slow when working with big data. In today’s deadline-driven world, the efficiency of the code is a prerequisite for the success of any data science project. To boost the efficiency of our code, it is important to use the fastest alternative solutions, such as Bodo. Essentially, Bodo offers our code unmatched performance and unlimited scalability. Unlike spark or other similar solutions, Bodo does not need us to make any major changes to our code or learn a new skill. Bodo makes up for Pandas’ lack of speed while staying equally powerful and user-friendly as Pandas.

In this short blog post, we want to compare the performance of theapply function executed on a Pandas DataFrame in Pandas and Bodo. For this comparison, we use used New York city parking violation dataset, sourced from the New York City website. Basically, New York City parking violation datasets have been a notorious example of big data that data scientists struggle with. We aim to see what hour in a day is most likely for a parking violation to happen. To achieve this goal, we need to create a new column in our data frame, which can be done using pandas.DataFrame.apply with a lambda function. The New York City website publishes NYC parking violation datasets every year. Here, we only focus on the year 2017 dataset as an example. Since this is a relatively large dataset and not all the columns are useful, we only load two columns: violation time and plate ID.


import pandas as pd
import numpy as np
import time
import bodo

nyc_parking_2017 = pd.read_csv( 'Parking_Violations_Issued_-_Fiscal_Year_2017.csv', usecols = [
'Violation Time', 'Plate ID']) nyc_parking_2017.head()

plateIDChart

Let’s check the data type of our columns and the numbers of data points we have:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10803028 entries, 0 to 10803027
Data columns (total 2 columns):
#   Column          Dtype
--- ------          ------
0   Plate ID        object
1   Violation Time  object
dtypes: object(2)
memory usage: 164.8+ MB

To extract the hour from the violation time column, we need to run a user-defined function (UDF) using pandas.DataFrame.apply. Running a UDF in Pandas takes a long time as apply generally does not take advantage of Pandas’ vectorization. Let’s take a look at the execution time of the below cell:


start = time.time()
nyc_parking_2017['hour'] = nyc_parking_2017['Violation Time'].apply( lambda time: time[:2] + ':' + time[2:] if len(str(time)) == 5 else None) nyc_parking_2017['hour'] = pd.to_datetime(nyc_parking_2017['hour'], errors='coerce').dt.hour
end = time.time() print("Reading Time: ", end - start) print(nyc_parking_2017['hour'].head())

Reading Time: 451.5508031845093
0 1.0
1 16.0    
2 14.0    
3 11.0    
4 17.0    
Name: hour, dtype: float64    

As demonstrated in the previous cell, running the UDF with one core in Pandas took 451 seconds (7.5 minutes). Unlike Pandas, Bodo can directly compile the apply code to keep the vectorization while implementing scientifically correct parallelization. This means that with minor changes in the code, we can significantly speed up our UDF. You can run Bodo for free on up to 4 cores. Here is a link to install it yourself: Bodo Installation. Now, let’s check out the performance of our code with Bodo on 4 cores:


%%px
@bodo.jit(distributed=['nyc_parking_2017'])

def hour_column():
start = time.time() nyc_parking_2017 = pd.read_csv( 'Parking_Violations_Issued_-_Fiscal_Year_2017.csv', usecols = ['Violation Time', 'Plate ID']) nyc_parking_2017['hour'] = nyc_parking_2017['Violation Time'].apply( lambda time: time[:2] + ':' + time[2:] if len(str(time)) == 5 else None) nyc_parking_2017['hour'] = pd.to_datetime(nyc_parking_2017['hour'], errors='coerce').dt.hour end = time.time() print("Reading Time: ", end - start) print(nyc_parking_2017['hour'].head())
return nyc_parking_2017
nyc_parking_2017 = hour_column()

[stdout:0]
Reading Time: 13.471014976501465
0 1
1 16
2 14
3 11
4 17
Name: hour, dtype: Int64

Wow, that was so fast! Running the same code in Bodo took only 13 seconds. That a 30X improvement over Pandas. Now that we have created the ‘hour’ column; we can sum up the total number of violation tickets per hour and visualize the results using Matplotlib and Seaborn libraries.


ticket_per_hour = nyc_parking_2017.groupby(['hour'])['Plate ID'].count()
ticket_per_hour = ticket_per_hour.to_frame().reset_index()
                .rename(columns={'Plate ID': 'Number of Tickets in Each Hour'})
plt.figure(figsize=(12,8))
sns.barplot(x = ticket_per_hour['hour'], 
            y = ticket_per_hour['Number of Tickets in Each Hour'], alpha=0.6)
plt.tick_params(labelbottom='on')
plt.ylabel('Number of cars', fontsize=16);
plt.xlabel('Hour', fontsize=16);
plt.title('Parking Tickets given at different time of the day', fontsize=16);

parkingChart

We show that the total number of parking tickets per hour reaches its maximum around noon and is at its minimum around 4 am.

Conclusion

Without a doubt, the Pandas data frames have become a staple for data scientists, but even simple Pandas functions struggle to keep up with the exponential growth of data. As demonstrated with a simple example in this post, data exploratory analysis with large datasets can be painfully slow. Bodo is a modern solution to significantly speed up the data wrangling process without the need to make major modifications to your Pandas source code and bring scalability to your data science workflow. I hope this blog post is useful to you and encourage you to check out Bodo’s website at https://bodo.ai. You can also reach out to me with any questions at niyousha@bodo.ai. Thank you for reading!