Bodo ♥️ PyIceberg: Scalable IO with the Simplicity of Python

February 26, 2025

Srinivas Lade

Introduction

Apache Iceberg is an open table format that provides simple, reliable, and scalable tables that are interoperable across various data tools such as compute engines. At Bodo, we believe that Iceberg is the future of data warehouse and data lake solutions due to its flexibility and open-source backing. However, using Iceberg at a large scale often requires Java-based tools which are difficult for developers to use and maintain.

This is where Bodo and PyIceberg come in—Bodo allows simple Iceberg use in Python using PyIceberg under the hoods, and without any Java components. This provides a Python environment for developers, while providing high performance and scalability of the Bodo engine.

With this approach, there’s no need for Spark clusters, complex JVM-based connectors, or additional infrastructure setup—just fast, efficient Iceberg table operations directly in Python.

‍

What is Bodo?

‍Bodo is an open-source, high-performance compute engine for Python data processing. Using an innovative auto-parallelizing just-in-time (JIT) compiler, Bodo simplifies scaling Python workloads from laptop to cluster without major code changes. Under the hood, Bodo relies on MPI-based high-performance computing (HPC) technology—making it both easier to use and often orders of magnitude faster than tools like Spark or Dask.

‍

What is PyIceberg?

PyIceberg is an open-source Python library for working with Apache Iceberg tables. It provides helpful functions to manipulate Iceberg databases, including creating and deleting Iceberg tables. In addition, it provides simple methods for reading and writing to Iceberg tables with data libraries and compute engines such as Pandas, DuckDB, Polars, Ray, and so on.

‍

Why PyIceberg

When we started working on Iceberg support in 2022, there was only one officially supported SDK available: the Iceberg-Java library. This library was used by engines like Spark and Trino, which worked well since they were also primarily written in Java. PyIceberg was also available, but with a very limited set of features and optimizations.

Since Bodo is a Python package that uses C / C++ extensions under the hood, we would have preferred to use a Python or C++ Iceberg SDK. But the overall feature coverage of the Java library made it seem like the better option. So we constructed a separate Python package, bodo-iceberg-connector, that used Py4J to communicate between Java and Python. While this allowed rapid development, it introduced challenges:

Complex setup: It was difficult to set up and use in local installations
Poor integration: The general user experience, from logging to error messages, was hard to integrate with the rest of our engine
Performance overhead: There was a noticeable amount of overhead using Java and Py4J for smaller examples.

With all of the amazing development done by the PyIceberg team coupled with our recent transition to open-source, we decided that it was time to replace our Iceberg connector with the equivalent PyIceberg backend. This simplifies using Iceberg tables of any scale in regular Python for developers by eliminating any Java-based components.

‍

Getting Started

To get started, install Bodo and PyIceberg together:

pip install bodo 'pyiceberg[glue]'

And from there, you can continue to write code that uses Iceberg IO with Bodo. For the following example, you need to have the AWS CLI set up locally for Bodo and PyIceberg to find credentials.

import shutil
import bodo
import pandas as pd

# ---------------- Setup with PyIceberg ---------------- #
df = pd.DataFrame({
    "a": [1, 2, 3, 4, 5],
    "b": ["Bill", "Bob", "Alice", "Sam", "Sarah"],
})

@bodo.jit
def write_python(df):
    df.to_sql("source_table", "iceberg:///tmp/wh", "test_ns")
write_python(df)

# ------------------- IO with Python ------------------- #
@bodo.jit
def read_python():
    return pd.read_sql_table("source_table", "iceberg:///tmp/wh", "test_ns")
print("From Python\n", read_python())

# ------------------- SQL Equivalent ------------------- #
import bodosql
bc = bodosql.BodoSQLContext(catalog=bodosql.FileSystemCatalog("/tmp/wh"))
print("From SQL\n", bc.sql('SELECT * FROM "test_ns"."source_table"'))

# ----------------------- Cleanup ----------------------- #
# Optional, remove if you want to check the tables!
shutil.rmtree("/tmp/wh")

Note that there are some features, like Puffin file support, that still currently require the Bodo Iceberg connector to be installed. We are working to get those features integrated into Bodo and PyIceberg directly, so stay tuned for more updates!

‍

Conclusion

Now more than ever, Bodo is focused on our goal to democratize high-performance data engineering. By integrating PyIceberg, we make it easier than ever for Python developers to leverage Apache Iceberg without Java overhead.

Unlike traditional Spark-based approaches that require a complex distributed setup, Bodo lets you run high-performance Iceberg queries directly in Python—just like working with Pandas. This makes it easier than ever to scale from local development to full-cluster execution without switching tools or rewriting code.

We’re looking forward to the next generation of Iceberg support. If you are as well, join our Slack community for updates, and share your feedback on GitHub!

‍