Performance Benchmarks

Compare Bodo's price and performance with other solutions using our latest well-defined, realistic, and reproducible benchmarks.

NYC Taxi Monthly Trips with Precipitation

For this benchmark, we adapt an example data science workload into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday.
‍
The New York City Taxi and Limousine Commission's For Hire Vehicle High Volume dataset (FHVHV) consists of over one billion trips. To get the weather on a given day, we use a separate dataset of Central Park weather observations. The FHVHV dataset consists of 1,036,465,968 rows and 24 columns ((24.7 GiB). The Central Park Weather dataset consists of 5,538 rows and 9 columns (514 KiB).

Each benchmark is collected on a cluster containing 4 worker instances and 128 physical cores. Dask, Modin, and Spark use 4 r6i.16xlarge instances, each consisting of 32 physical cores and 256 GiB of memory. Dask Cloud Provider also allocates an additional c6i.xlarge instance for the distributed scheduler which contains 2 cores. Bodo is run on 4 c6i.16xlarge instances, each consisting of 32 physical cores and 64 GiB of memory.

Performance Derived From TPC-H Benchmark

Derived from the standard TPC-H benchmarks, we compared Bodo to Spark and Dask for data processing workloads. We used scale factor 1,000 (~1 TB dataset) on a cluster of 16 c5n.18xlarge AWS instances which has 576 physical CPU cores and 3 TB of total memory. Bodo provided a median speedup of 23x over Spark —95%+ infrastructure cost savings— and 148x over Dask.

Data Engineering for ML: Data Derived From TPCxBB Q26

A Fortune 10 customer implemented a benchmark derived from TPCx-BB Query 26. They compared Bodo's compute performance on a straightforward translation of the SQL query into Pandas with the SparkSQL version of this query which was configured for optimal performance by their engineers. Bodo is 16.5x faster than optimized Spark on a 125-node cluster (AWS c5n.18xlarge) with 4,500 CPU cores, input data at scale 40,000, and 52 billion rows (2.5TB data in compressed Parquet format).

Data Engineering: TeraSort

Our customer ran another benchmark for data engineering. In this test, Bodo is 9x faster than optimized Spark on a 125-node cluster (AWS c5n.18xlarge) with 4,500 CPU cores and input data at a scale factor 10,000 of TeraSort with 100B rows (4TB in compressed Parquet format).

Retail Product Analytics

Our customer ran another benchmark, analyzing the performance of filtering data with customized user-input filters and joining the resulting group back with the original dataset. Bodo is 11x faster than optimized PySpark on a 16-node cluster (AWS c5n.18xlarge) with 576 CPU cores and 120GB of data in compressed Parquet.

End-to-End Machine Learning

Our customer benchmark for an end-to-end ML pipeline, including Data Load, Data Prep, Feature Engineering, ML Training, and ML Prediction, also was extremely successful. Bodo is 120x faster than PySpark on an r5d.16xlarge AWS node with 32 CPU cores.

Retail Price Image Management

Our customer benchmark for retail price image management using simulations yielded equally impressive results. Bodo on a 4-node cluster (AWS m5.24xlarge) is 85x faster than multi-processing Python on a single node.

NYC Taxi Monthly Trips with Precipitation

TPC-H Benchmark

Using the standard TPC-H benchmarks, we compared Bodo to Spark and Dask for data processing workloads. We used scale factor 1,000 of TPC-H (~1 TB dataset) on a cluster of 16 c5n.18xlarge AWS instances which has 576 physical CPU cores and 3 TB of total memory. Bodo provided a median speedup of 23x over Spark (95%+ infrastructure cost savings) and 148x over Dask.

TPC-H Benchmark

Data Engineering for ML: TPCxBB Q26

Customer benchmark for data engineering (ETL and feature engineering). Bodo is 16.5x faster than optimized Spark on a 125-node cluster (AWS c5n.18xlarge) with 4,500 CPU cores; input data is scale 40,000 of TPCxBB with 52 billion rows (2.5TB data in compressed Parquet format).

Data Engineering: TeraSort

Customer benchmark for data engineering. Bodo is 9x faster than optimized Spark on a 125-node cluster (AWS c5n.18xlarge) with 4,500 CPU cores; input data is scale factor 10,000 of TeraSort with 100B rows (4TB in compressed Parquet format).

Retail Product Analytics

Customer benchmark for filtering data using customized user-input filters and joining the resulting group back with the original dataset. Bodo is 11x faster than optimized PySpark on a 16-node cluster (AWS c5n.18xlarge) with 576 CPU cores (input data is a 120GB data in compressed Parquet).

End-to-End Machine Learning

Customer benchmark for an end-to-end ML pipeline including data load, data prep, feature engineering, ML training, and ML prediction. Bodo is 120x faster than PySpark on an r5d.16xlarge AWS node (32 CPU cores).