Run any Python library in parallel with @bodo.wrap_python

January 30, 2025

Ehsan Totoni

Python’s ecosystem is filled with powerful libraries, and the last few years have made it the go-to language for machine learning and AI. However, running custom operations from these libraries in parallel is challenging because it is a hassle to use anything but built-in operations in most parallel engines. This complexity often leads to tedious workarounds or suboptimal performance.

At Bodo, our north star is the ability to parallelize Python code and massively speed up data processing with minimal complexity. We recently made it even simpler. Bodo’s new @bodo.wrap_python decorator makes it very easy to run libraries in parallel while using Bodo to deliver very high parallel performance to the whole workflow.

‍

What is Bodo?

Bodo is an open-source, high-performance compute engine for Python data processing. Using an innovative auto-parallelizing just-in-time (JIT) compiler, Bodo simplifies scaling Python workloads without major code changes. Under the hood, Bodo relies on MPI-based high-performance computing (HPC) technology—making it both easier to use and often orders of magnitude faster than tools like Spark or Dask.

‍

Call regular Python with @bodo.wrap_python

Bodo’s new @bodo.wrap_python decorator lets you call any regular Python function inside Bodo’s parallel JIT code without extra code changes. You only need to specify the data type for the function’s return value—Bodo takes care of the rest.

This allows you to keep using your favorite Python libraries in parallel while still benefiting from Bodo’s high-performance execution for the rest of your workflow.

Key Benefits

Simplicity: No rewriting or bridging code—just wrap your existing functions.
Flexibility: Call any library code (e.g., HuggingFace Tokenizers, custom ML transforms).
Performance: Combine native JIT acceleration with your library’s capabilities.

‍

Example: Tokenizing a DataFrame Column

Below is a quick sample using Hugging Face Transformers. It applies a tokenizer on each row of a DataFrame, with the results stored in parallel.

out_type = bodo.typeof(np.array([1, 2]))

@bodo.wrap_python(out_type)
def run_tokenizer(text):
    tokenized = tokenizer(text, ...)
    return np.array(tokenized["input_ids"])

@bodo.jit
def example():
    ...
    input_ids = df["A"].map(run_tokenizer)

The Python function runs in regular Python and can be any Python code without JIT compilation restrictions. The output type is provided simply using a sample of the function’s output values and should be supported by the Bodo compiler. This allows the compiler to convert output Python objects into native JIT values and continue the JIT execution.

See Bodo docs here for more information. The GitHub repository also has several examples here using this feature. Here is a full example code for reference. I saw ~5x speedup on my local M1 Mac with a sample 100k row dataset and of course this code can now scale to any cluster and data size as well.

import pandas as pd
import numpy as np
import bodo
from transformers import AutoTokenizer
import time


tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer.pad_token = tokenizer.eos_token
out_type = bodo.typeof(np.array([1, 2]))


@bodo.wrap_python(out_type)
def run_tokenizer(text):
    tokenized = tokenizer(text, truncation=True, max_length=512, padding="max_length")
    return np.array(tokenized["input_ids"])


@bodo.jit
def example():
    t0 = time.time()
    df = pd.read_parquet("data.parquet")
    input_ids = df["A"].map(run_tokenizer)
    pd.DataFrame({"input_ids": input_ids}).to_parquet("output.parquet")
    print("Time:", time.time() - t0)

if __name__ == "__main__":
    example()

‍

How It Works Under the Hood

When your parallel JIT code hits the run_tokenizer function, Bodo temporarily “boxes” the arguments into standard Python objects, calls the function as plain Python, and then “unboxes” the output back into native JIT values. The compiler also adds mechanisms that allow caching the generated binary if possible. This is all built on top of Numba and llvmlite compiler infrastructure.

This feature is similar to Numba’s objmode, but it has a simpler function interface and can be significantly faster since it avoids some of objmode’s runtime overheads (wrap_python is 3.5x faster than the objmode for the example above). If you’re curious about the implementation details, check out the PR on GitHub.

‍

Getting Started

This feature is available in Bodo pip and Conda releases starting from 2024.12.3. To try it out on macOS or Linux, install Bodo with:‍

pip install bodo

(Windows support is under active development.) For more examples, see the Bodo documentation and our GitHub repository.

Coupled with Bodo’s high-performance JIT compiler and MPI-based backend, you can now combine your favorite Python toolkits with the power of HPC—without complicated rewrites. Give Bodo a try and share your feedback on GitHub or our Slack community!