Python data processing engine comparison with NYC Taxi trips (Bodo vs. Spark, Dask, Ray)

January 20, 2025

Ehsan Totoni

Python has become the go-to programming language for data engineers and data scientists tackling data-intensive and AI applications. However, scaling Python code efficiently remains a daunting challenge for many developers. Compute engines like Bodo, Spark, Dask, and Ray/Modin aim to bridge this gap, offering Python scaling while striving for high performance.

But with such a wide range of capabilities and trade-offs, how do you choose the right tool for your workload? We at Bodo focus on making it much easier to run Python faster, and show significant performance (and cost) advantages. So we decided to test it with a benchmark.

We chose an example Python program that computes the summary of monthly trips with precipitation data on the NYC Taxi public dataset. This example is simple, but represents many Python/Pandas data processing programs. Other benchmarks are usually translated from SQL (e.g. TPC-H) and do not represent typical Python workflows. In particular, the power of Python as a general language is writing custom procedural code (e.g. in Series.map or DataFrame.apply) which SQL oriented benchmarks often lack. We believe more Pythonic benchmarks like this are necessary in this area in general.

Our results reveal massive performance differences: Bodo delivered a 20x speedup over Spark (95% cost savings), 50x over Dask (98% cost savings), and a staggering 250x over Ray/Modin (99% cost savings) for this benchmark. This can be attributed to Bodo’s HPC-based compiler approach as opposed to distributed task scheduling design of other engines.

We’ll be the first to admit we’re biased—we know Bodo best. However, we approached this benchmark with transparency and fairness, following appropriate practices and openly sharing all benchmark code here (designed to be easily reproducible). Throughout the effort, our focus is what a regular Python/Pandas user can achieve without wholescale code rewrites and expert tuning. We welcome your feedback, replication, and contributions to improve these findings. Let’s explore what these compute engines can do for your Python workloads!

Engine Overview

Here is a short description of the engines we tested here:

Bodo: Utilizes native Python compilation and MPI-based backend to achieve near-C performance and massive parallelism. It focuses on native Python workflow and HPC-grade performance for scalable data processing.
Apache Spark: A well-established distributed computing platform designed for fault tolerance and massive scalability, though it often incurs overhead from its JVM-based design.
Dask: Python-native, enabling flexible computation that scales using familiar APIs while enabling distributed computing.
Ray: Targets parallel and distributed Python applications, designed for scalability and flexibility, but may struggle with high-cost efficiency for certain workloads.

‍

Benchmarking: Bodo vs Spark, Dask, Ray

The NYC Taxi dataset is a collection of taxi rides that occurred in New York City since 2009. It is a very interesting real-world public dataset, often used for data science examples and benchmarking. It includes detailed information such as pickup date and time and trip coordinates. We use the For-Hire Vehicles High Volume (FHVHV) subset, which is the largest subset focused on for-hire vehicles (Uber, Lyft, etc.) from 2019 up to the beginning of 2024. Our example code is derived from code used by Todd W. Schneider in his fascinating blog post.

Benchmark Setup

We use a 4 node cluster setup on AWS to run the benchmark on all engines (Dask uses a small separate scheduler instance also). The cluster setup is chosen to make sure all engines have enough memory (Spark in particular) but Bodo is able to run this benchmark on a fraction of this cluster setup due to efficient memory use. This table summarizes the cluster configuration:

The total size of the dataset is 24.7 GiB in Parquet format. The Central Park Weather data is stored in a single CSV file on S3 and its total size is 514 KiB.

Software Versions

Here are the software versions used for reproducibility:

‍

Performance Results

Below is the graph of execution times of the benchmark for different systems. We ran multiple times and averaged to get to these numbers and they seem roughly consistent across runs. Bodo is 20x faster than Spark (95% cost savings), 50x faster than Dask (98% cost savings), and 250x faster than Ray/Modin (99% cost savings) for this benchmark.

Bodo and PySpark are the fastest engines for this benchmark. Below is the graph that compares them more clearly.

Local Runs

We use a smaller subset of the FHVHV dataset to allow the workload to run locally on a laptop since running code locally first is a typical workflow of many developers. We also included an implementation using Pandas for local comparison. For this setup, we use a 20 million row subset of the data to make sure it fits in memory for all engines except PySpark that demands very high memory needs (Bodo can run on substantially larger datasets locally).

Even at this smaller scale, Bodo shows a roughly 4x improvement over Pandas, while other engines can be substantially slower than regular Pandas.

Working with Bodo

Running the Pandas code with Bodo was just a matter of adding the @bodo.jit decorator to the compute function for us (although our existing knowledge of Bodo may have played a role). Bodo was able to run on much smaller cluster sizes and had no issues scaling on other configurations.

Working with Dask

Dask’s DataFrame library made it relatively easy to translate the Pandas workload with only a few changes required other than the trivial changes of replacing `pd` with `dd`. While Dask’s Dataframe library matches Pandas API’s closely, there was one place where an argument was not supported (`reset_index=False` in `df.groupby`) and another where the default differed slightly (sort_values requiring the `ascending=True` argument). However, these changes were relatively straightforward to make as a Pandas user. The only performance critical change made was specifying the output type of the UDF. This would have been difficult to catch had it not been for a helpful user warning. Following Dask’s best practices, only the final result was actually computed while the rest of the workload was lazily evaluated, allowing for optimizations to the task graph. Running the full dataset on a smaller cluster initially led to an out of memory error, but we eventually resolved it by increasing the instance size (from c6i.4xlarge to r6i.8xlarge).

Working with Modin on Ray

Modin’s DataFrame library also made it very easy to port the Pandas workload due to nearly all of the APIs matching Pandas exactly. The only change that was needed involved converting the Modin DataFrame to a Ray object before writing to Parquet. While getting Modin on Ray to run on a small dataset was straightforward, running the full dataset led to out of memory issues on smaller instance sizes. Even on a large cluster size (r6i.16xlarge), Ray spilled objects to disk, which required us to adjust local disk size of instances.

Using Ray for cluster creation and configuration was also for the most part painless, as the provided yaml file from their quickstart documentation made it relatively simple to customize the cluster to fit our use case. The main difficulty working with Ray came from using the autoscaler, which by default only creates one node on cluster creation. To ensure fairness in our comparison, we used the Ray SDK to request and then wait for the worker nodes to be fully set up before starting the benchmark run.

Working with PySpark

The Spark runs had the most difficulties to work through than other systems. We chose the Pandas on Spark APIs instead of PySpark to avoid complex code rewrites that may not be practical for regular Pandas users. The conversion of the Pandas workload into Pandas on Spark appeared easy at first, as we would just replace calls to Pandas with Pandas on Spark. However, many issues appeared that needed to be resolved. First, access issues required replacing s3a credential provider with the AnonymousCredentialsProvider. Second, we read the data in regular PySpark APIs and dropped four datetime columns that Pandas on Spark could not read (both with and without spark.sql.execution.arrow.pyspark.enabled). Third, Spark wouldn’t unify the datatypes of some of the columns, even with mergeSchema enabled so we had to resort to reading and then rewriting the dataset with Bodo so it had a unified schema. In summary, getting the Spark version of the code (especially with Pandas on Spark APIs) took substantial amounts of time and required working through several issues.

‍Why Bodo Excels

Bodo’s advantage in this benchmark is due to its fundamentally different architecture and design:

Native Compilation: Unlike others like Spark, Bodo compiles Python into native machine code, avoiding overhead from interpreters and VMs.
MPI-Based Parallelism: Delivers HPC-grade performance with efficient communication and minimal latency. This avoids overheads of driver-executor distributed library design of other systems.
Ease of Use: Python-native syntax without sacrificing performance, making it easy to use by data scientists.

Bodo’s ability to bridge the gap between ease of use and HPC-grade performance makes it a strong competitor to existing engines like Spark, Dask, and Ray. While Spark remains a dominant force for distributed systems due to its ecosystem, and Dask and Ray excel in flexibility, Bodo offers unparalleled speed, ease-of-use, and cost efficiency for compute-heavy workloads.

Conclusion

Python data processing benchmarks like the one we tested here are necessary for comparing compute engine performance. While Spark, Dask, and Ray each have strengths—such as flexibility, or large ecosystems—Bodo’s HPC-based compilation and MPI parallelism deliver unmatched speed and cost efficiency in this particular test. From a 20x speedup over Spark to a 250x speedup over Ray/Modin, the improvements are striking, especially given the native, Pythonic code style that Bodo supports.

That said, no single benchmark is definitive. We encourage you to explore our open-source benchmark code, replicate our findings, and see how these tools perform in your own environment. If you need high scalability and performance without sacrificing familiar Python workflows, Bodo’s unique approach is worth trying out. You can install Bodo using just pip install bodo. Visit our GitHub repository for more information and join the conversation in our community Slack.

‍