A few weeks ago, we explored how Spark, Dask, and Ray handle scaling on a Pandas-based workload—NYC taxi trip data—on a multi-node cluster with minimal code changes. If you missed it, check out the post here (spoiler: Bodo was 20x-250x faster than Modin on Ray, Dask, and Spark).
That comparison highlighted a key challenge: many engines force users to compromise between performance and ease of use. While tools like Spark and Dask offer distributed execution, they introduce significant overhead, orchestration complexity, and inefficient task scheduling—often requiring users to rethink their approach or modify their code.
This time, we’re adding two more contenders to the mix: Polars and Daft. Unlike the tools in our last comparison, these engines take a different approach using lazy evaluation and their own custom APIs instead of directly scaling Pandas workflows.
Our results reveal significant performance differences: Bodo delivered a 2x speedup over Daft, and a 10x speedup over Polars—all while keeping the simplicity of Pandas intact—meaning no rewrites, no learning curve, and no extra work. Let’s dive into the details.
We will use the same workload and settings from the previous analysis and provide new implementations in both Daft and Polars. To recap, this workload summarizes monthly taxi and rideshare trips from New York City’s Taxi and Limousine Commission (TLC) and precipitation data from Central Park weather observations. The key operations in this workload include reading Parquet and CSV files from S3, joining a weather dataset with TLC data, applying a user defined function (UDF) on a column of data (i.e. using Series.map
or equivalent), a group by aggregate to get the trip summaries, and finally, sorting the output and writing it to Parquet. The total size of the dataset is 24.7 GiB in Parquet format. The Central Park weather data is stored in a single CSV file on S3 and its total size is 514 KiB. This workload was derived from code used by Todd W. Schneider (check out his blog post to learn more).
Polars is a DataFrame library written in Rust with its own built-in query optimizer and lazy APIs. It’s designed to deliver performance and scaling benefits over Pandas both locally and on single node setups in the cloud.
Daft is also a DataFrame library written in Rust with its own built-in query optimizer and lazy APIs. It is designed to be highly scalable and by integrating with Ray, it can be used across multi-node clusters.
Refer to the previous article to see an overview of the other engines tested on this workload.
We reuse the setting from the original benchmark which is summarized in the table below:
Notably, since Polars is designed for a single node, we only used a single r6i.16xlarge instance. We also measured the performance of Bodo and Daft using the same settings as Polars for a more concrete comparison of performance.
We found that Bodo was ~2x faster than Daft for this particular workload and set of APIs. Daft itself performed about 10x faster than the next best implementation from the previous article (PySpark). While Polars could not be evaluated in the full distributed setting, its performance seemed to be significantly slowed by the use of the map_elements
API which executes in a single Python thread, making it a significant bottleneck. Daft on the other hand executes the UDF in parallel over each partition, which makes it scale a lot better on this workload.
Although the UDF in this workload can be rewritten using SQL or SQL-like APIs, which would improve the performance of Daft and Polars, real workloads typically have much more complex Python UDFs that may not be practical to rewrite. For reference, we also measured performance on the same workload using Daft and Polars’ native expressions and found that Bodo’s performance was still faster, although the difference was less extreme, especially when comparing to Polars.
The original UDF looks like:
# place rides in bucket determined by hour of the day
def get_time_bucket(t):
bucket = "other"
if t in (8, 9, 10):
bucket = "morning"
elif t in (11, 12, 13, 14, 15):
bucket = "midday"
elif t in (16, 17, 18):
bucket = "afternoon"
elif t in (19, 20, 21):
bucket = "evening"
return bucket
and can be rewritten in Daft using daft.sql_expr
(or if_else
if one wishes to remain in Python):
time_bucket_expr = daft.sql_expr("""
CASE
WHEN hour >= 8 AND hour <= 10 THEN 'morning'
WHEN hour >= 11 AND hour <= 15 THEN 'midday'
WHEN hour >= 16 AND hour <= 18 THEN 'afternoon'
WHEN hour >= 19 AND hour <= 21 THEN 'evening'
ELSE 'other'
END
""")
Which saves roughly ~12s on the average workload time in the distributed setting, translating to a little less than 1.5x slower than Bodo.
In Polars, you can use when
and then
expressions to achieve a similar result:
monthly_trips_weather = monthly_trips_weather.with_columns(
pl.when(pl.col("hour").is_in([8, 9, 10]))
.then(pl.lit("morning"))
.when(pl.col("hour").is_in([11, 12, 13, 14, 15]))
.then(pl.lit("midday"))
.when(pl.col("hour").is_in([16, 17, 18]))
.then(pl.lit("afternoon"))
.when(pl.col("hour").is_in([19, 20, 21]))
.then(pl.lit("evening"))
.otherwise(pl.lit("other"))
.alias("time_bucket")
)
This saves roughly 230s or about 80% of the workload time on average. The rewritten code is still roughly 2x slower than Bodo running on a single node.
While these are reasonable approaches for users who want to get the best performance out of their tools, we still believe that showing performance using map
like APIs is important because:
Additionally, more advanced users may leverage Polars’ integration with Numba to accelerate UDFs using the @guvectorize
decorator, but this too can come with its own limitations, such as constraints on the types of operations that can be compiled and executed (see our Bodo vs Numba post for more details)
To reiterate our points from the last article, Bodo excels at this benchmark because it compiles the entire workload—along with the UDF— into efficient machine code, allowing for powerful and fine grained optimizations. And since the workload is automatically executed in parallel using an MPI backend, it is able to effectively utilize the large amount of compute available while avoiding some of the overheads of task-based parallelism. This enables:
We also compared the performance of each engine on a smaller example consisting of roughly 20,000,000 rows on a 2024 Macbook Pro. The results are summarized below:
Running on a small dataset locally, Polars and Bodo are fairly comparable, Daft is about 1.75x slower than Bodo in this case, but is still much faster than Pandas.
Implementing this workload in Daft required learning a new set of lazy APIs that are similar to SQL, requiring a different mental model when programming and debugging versus Pandas. Daft also uses a different syntax for indexing into DataFrames using expressions which can be unintuitive for some users. While Daft does have good documentation and their APIs are relatively simple, it is still quite a shift from Pandas and other engines like Modin, Dask, or even PySpark, so it is important to take into account the overhead of learning a new tool when deciding whether to switch. On the other hand, Daft’s integration with Ray made it relatively straightforward to go from a notebook running locally to a fully scaled example running on a cluster.
Similar to Daft, Polars had its own set of lazy, SQL-like APIs, although it was easier to switch between lazy and eager execution for experimentation than Daft. It’s important to be mindful of subtle bugs that occur when migrating to new APIs arising from small differences in the parameters and outputs of functions. For example, in Pandas, the attribute dayofweek
evaluates to 0 for Monday, whereas in Polars, Expression.dt.weekday()
uses the convention that Monday is 1. Another issue we ran into when running Polars was that it was unable to read the full dataset due to inconsistencies in the schemas between some of the Parquet files. For example, the PULocation
column was typed as i32
in some files and i64 in others. To get around this, we used a different version of the dataset that had been read using Bodo and rewritten in a consistent schema (see our discussion about PySpark in the previous article for more details). Another key consideration when working with Polars is that it is intended to be used on a single node, which can limit its effectiveness for large datasets.
Ease of use is an important factor when choosing a data processing engine for scaling Python workloads. And scale often comes with complexity. Bodo eliminates this friction by preserving Pandas APIs while applying aggressive compiler optimizations and parallel execution for massive performance gains. The result? Bodo is 2x faster than Daft and 10x faster than Polars—without requiring changes to existing Pandas code or UDFs. Whether running locally or across multiple nodes, Bodo delivers high performance with minimal overhead for developers—making it an ideal choice for easily scaling Pandas workloads.
Ultimately, the choice of engine depends on the specific workloads and user preferences. This benchmark is just one workload using a selection of APIs and we are of course a little biased towards our own engine. We encourage you to test on your own, reproduce our results, and experiment with your own workloads! We would love to hear your feedback on how we can improve our benchmarks.
If you need high scalability and performance without sacrificing familiar Python workflows, Bodo’s unique approach is worth trying out. You can install Bodo using just pip install bodo
. Visit our GitHub repository for more information and join the conversation in our community Slack.