Large Language Models (LLMs) can be computationally intense and inference speed can quickly become a bottleneck—especially when you're sending multiple queries and waiting for responses. Slow response times don’t just lead to delays; they make real-time applications impractical, drive up compute costs, and create scalability challenges. The challenge here isn’t just making LLMs faster—it’s finding a way to optimize performance without complex infrastructure or code rewrites.
Bodo is a high-performance, parallel Python compute engine which accelerates data processing with minimal code changes. Bodo enables developers to achieve high-performance inference in a way that is both scalable and easy to integrate into existing Python workflows.
This post examines how we used Bodo to speed up LLM inference, using the example llm_inference_speedtest.ipynb from our repo, showcasing dramatic LLM inference speedups via Bodo parallelism. We’ll explore the example, its value, and how to use it. We’ll also touch upon when Bodo doesn't directly provide a speedup, and why understanding bottlenecks is key.
Since we created an entirely new way for Bodo to call regular python, we can now specify which functions we want to parallelize (I/O, processing, etc) and which functions we simply want to “wrap” and call as a Pythonic black box.
Bodo’s new @bodo.wrap_python
decorator lets you call any regular Python function inside Bodo’s parallel JIT code without extra code changes. You only need to specify the data type for the function’s return value—Bodo takes care of the rest.
To speed up LLM inference (especially if we’re doing a lot of queries), we could just try to parallelize how we send the queries to the model.
In this case, we’re using Simon’s wonderful llm library as a simple way to send messages to multiple models with a simple switch.
The code is structured such that it takes in a series of queries (raw_prompts), and sends them to the model one at a time to get responses. It’s something I do quite often, especially if I’m running evaluations. It would be useful to find a faster way to simply parallelize it without needing to spend a lot of effort or time.
That’s where Bodo comes in. Bodo delivers automatic Python parallelization for data-heavy tasks. Static compilation, compiler optimizations, and distributed execution yield near-C++ speeds while maintaining Python's ease of use. And if there isn’t a specific benefit to compilation, we can even wrap it, as we do right now.
Integration is simple: add the @bodo.jit
decorator. No code rewrites or complex distributed programming needed.
The code:
"""Preprocess and query LLMs:
- Install llm package using pip install llm
- Set keys for the model you'd like to use, using llm keys set [MODEL]
"""
import pandas as pd
import time
from dotenv import load_dotenv
import llm
import bodo
import os
load_dotenv()
MODEL = "gemini-1.5-flash-8b-latest"
model = llm.get_model(MODEL)
@bodo.wrap_python(bodo.string_type)
def query_model(prompt):
"""
Sends a prompt to the AI Suite and returns the response.
"""
response = model.prompt(prompt)
return response.text()
@bodo.jit
def query_model_all(file_path):
"""Clean up prompts and query the model for all prompts in the dataframe."""
t0 = time.time()
df = pd.read_csv(file_path)
cleaned_prompts = df["prompt"].str.strip().str.lower()
df["response"] = cleaned_prompts.map(query_model)
print("Processing time:", time.time() - t0)
return df
if __name__ == "__main__":
# Read all prompts from the file
file_path = "data/prompts.csv"
# Query the model for all prompts at once
out_df = query_model_all(file_path)
# Print the resulting DataFrame with responses
print(out_df)
Choose the relevant model to use, and input the prompt that’s needed and press go.
@bodo.jit
compiles and parallelizes LLM inference via Gemini, in this instance, but the same could be done for any model.
The code is relatively simple, and measures latency and tokens/second across iterations to quantify speed. Pandas presents clear, summarized benchmark data.
Speedup Example: In our example we see a significant speed increase with Bodo. Typical results show 15x+ speedups in LLM inference (1.3 secs for Bodo including overhead vs 20 secs in regular python, for 30 separate queries). This was tested locally on a 2021 M1 Mac with 10 cores.
Bodo accelerates Python code where data processing is the bottleneck. In testing inferencing, we also looked at running programs entirely locally, with query_llm_ollama.py. This script queries LLMs via Ollama, which is designed to be run locally. In this case with my M1 Macbook Pro.
Applying @bodo.jit
in query_llm_ollama.py likely won’t yield a significant speedup. Why? Because the bottleneck here is often not the Python client code. Bodo doesn't have lazy evaluation difficulties of Spark and Dask and is a lot easier to profile. Ollama takes up the CPU for each LLM call, so parallelization of calls doesn’t help.
Bodo excels at optimizing compute-heavy Python. If the bottleneck lies elsewhere—in external systems or network delays—Bodo’s direct impact on overall workflow time will be reduced. Recognizing where bottlenecks exist in your application is critical for choosing the right optimization approach.
Experiment by changing models, removing @bodo.jit
, and comparing results!
The example demonstrates Bodo’s power for faster LLM inference. Automatic parallelism makes Python near-native speed.
At Bodo, our focus is to accelerate data scientists' Python workflows, removing performance limitations. While showcased in LLMs here, Bodo's benefits extend across data processing, bioinformatics, finance, and more. We aim to broaden Bodo's reach and will share more advancements soon, emphasizing effective optimization requires understanding your application's performance bottlenecks. Be sure to join our Slack community to connect with other users, share use cases, and stay up-to-date on the latest from Bodo!