AWS S3 Tables is a new service that simplifies Iceberg open table format usage by providing setup and maintenance of Iceberg tables automatically. This includes built-in table maintenance and optimization, which simplifies managing large datasets. However, the main option currently for using S3 Tables is Apache Spark, which requires complex configuration and a Java-based engine. Bodo addresses this challenge by providing simple S3 Tables usage in both Python and SQL, simplifying read and write of large tables with high performance and efficiency.
Bodo is an open-source, high-performance compute engine for Python data processing. Using an innovative auto-parallelizing just-in-time (JIT) compiler, Bodo simplifies scaling Python workloads from laptop to cluster without major code changes. Under the hood, Bodo relies on MPI-based high-performance computing (HPC) technology—making it both easier to use and often orders of magnitude faster than tools like Spark or Dask.
Bodo enables efficient querying and processing of S3 Tables without the need for complex infrastructure or data movement. Unlike traditional Spark-based approaches, Bodo enables efficient table operations directly in Python using familiar Pandas APIs to read and write Iceberg tables like normal dataframes. This means you can perform analytical queries, aggregations, and transformations on S3 Tables without needing to migrate data to another system or configure distributed frameworks manually.
To switch a script from running on a local dataset to a remote Iceberg table on S3 Tables, you just need the table bucket ARN and table namespace.
To access S3 Tables, make sure you have active AWS credentials (e.g. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set or credentials set in ~/.aws/config) and the necessary dependencies are installed and upgraded. Also, make sure the user associated with your credentials has the AmazonS3TablesFullAccess policy attached.
You can write a dataframe with Bodo to S3 Tables using the to_sql
method of DataFrames.
@bodo.jit
def write_table(df, namespace, bucket_arn):
df.to_sql(
name="my_table",
con=f"iceberg+{bucket_arn}",
schema=namespace,
if_exists="replace",
)
You can read a S3 Tables table into a dataframe with Bodo using the Pandas read_sql_table function
.
@bodo.jit
def read_table(bucket_arn, namespace, table_name):
df = pd.read_sql_table(
table_name=table_name,
con=f"iceberg+{bucket_arn}",
schema=namespace,
)
return df
This feature is available in Bodo pip and Conda releases starting from 2025.1
. To try it out on macOS or Linux, install Bodo with:
pip install bodo bodosql bodo-iceberg-connector
(Windows support is under active development.) For more examples, see the Bodo documentation and our GitHub repository.
Coupled with Bodo’s high-performance JIT compiler and MPI-based backend, you can now combine your favorite Python toolkits with the power of HPC—without complicated rewrites. Give Bodo a try and share your feedback on GitHub or our Slack community!
Here is a full example of how S3 Tables can be used alongside Bodo. First it sets up S3 Tables by creating a table bucket and a namespace within the table bucket. It then writes a table of random data that is read back and a simple aggregation is computed. The aggregation is written to a new table. Finally, it cleans up the created AWS resources.
import os
import boto3
import numpy as np
import pandas as pd
import bodo
os.environ["AWS_REGION"] = "us-east-2"
# Create an S3 Tables Bucket table bucket and namespace
s3_tables_client = boto3.client(
"s3tables",
region_name=os.environ.get("AWS_REGION", "us-east-2"),
)
bucket_arn = s3_tables_client.create_table_bucket(
name="my-bucket" + str(np.random.randint(0, 9999)),
)["arn"]
namespace = s3_tables_client.create_namespace(
tableBucketARN=bucket_arn,
namespace=["my_namespace"],
)["namespace"][0]
conn_str = f"iceberg+{bucket_arn}"
@bodo.jit
def write_table(conn_str, namespace):
"""Create a table in an S3 Tables bucket full of random data"""
df = pd.DataFrame(
{
"A": np.random.randint(0, 100, (10_000,)),
"B": np.random.randint(0, 100, (10_000,)),
}
)
df.to_sql(
name="my_table",
con=conn_str,
schema=namespace,
if_exists="replace",
)
@bodo.jit
def group_table(conn_str, namespace):
"""
Group a table by column A taking the sum
and write the result to a new table
"""
df = pd.read_sql_table(
table_name="my_table",
con=conn_str,
schema=namespace,
)
df.groupby("A").sum().to_sql(
name="my_table_grouped",
con=conn_str,
schema=namespace,
if_exists="replace",
)
@bodo.jit
def read_group_table(conn_str, namespace):
"""Read my_table_grouped"""
df = pd.read_sql_table(
table_name="my_table_grouped",
con=conn_str,
schema=namespace,
)
return df
write_table(conn_str, namespace)
group_table(conn_str, namespace)
print(read_group_table(conn_str, namespace))
# Cleanup the S3 Tables tables, namesapce, and table bucket
s3_tables_client.delete_table(
tableBucketARN=bucket_arn,
namespace=namespace,
name="my_table",
)
s3_tables_client.delete_table(
tableBucketARN=bucket_arn,
namespace=namespace,
name="my_table_grouped",
)
s3_tables_client.delete_namespace(
tableBucketARN=bucket_arn,
namespace=namespace,
)
s3_tables_client.delete_table_bucket(
tableBucketARN=bucket_arn
)