In modern bioinformatics, large-scale genomic data is the norm, requiring efficient ways to read, process, and combine massive collections of sequence files for meaningful analyses. Reading multiple large FASTA files—a routine step in genomics—can become a bottleneck when datasets scale. This post shows how to efficiently parse and combine FASTA sequences in parallel using Bodo, illustrating how quickly we can produce a single CSV of all sequences from many files.
Why is this a problem?
This is where Bodo—a high-performance Python framework—comes in. The example demonstrates how to turn a computationally intensive task into a parallelized, memory-efficient process. Let’s break down how it works and why it matters.
The example reads multiple FASTA files in parallel and extracts sequence IDs alongside their peptide (or nucleotide) strings. Specifically:
SeqIO.parse
to read every sequence record.annotation
(ID), PROBE_SEQUENCE
(the actual sequence), and source
(the file's basename) in a DataFrame.
A standard Python implementation uses a for
loop to iterate over the sequence. This works for small sequences but becomes impractical for gigabase-scale data. With Bodo, you add a @bodo.jit
decorator to the function you want to parallelize, and a @bodo.wrap_python
decorator to the function that then gets called, and that’s it:
@bodo.wrap_python(df_type)
def fasta4epitope(fasta_file):
# Parse a single FASTA file, returning ID/Sequence pairs
...
@bodo.jit(cache=True)
def process_all_fastas(fasta_files, out_path):
# Loop over each FASTA file in parallel, call `fasta4epitope`,
# concatenate results, then save to CSV.
...
The only differences here are the @bodo
decorators, extremely minimal code changes.
Bodo’s approach:
@bodo.wrap_python(df_type)
wraps a standard Python function so it can be called within a Bodo-jitted context.@bodo.jit
+ bodo.prange
splits the list of files across parallel workers.
1. Install Bodo via pip
Bodo is pip-installable. On Linux/Mac:
pip install bodo
2. Run the Example
Execute the script like standard Python:
"""
This is a pipeline for ingesting multiple FASTA files (using BioPython's SeqIO),
extracting sequence IDs and their corresponding peptide strings, and consolidating them
into a unified CSV file. It uses Bodo to parallelize the process, which can
significantly expedite data processing, especially on larger datasets.
The end result is a single CSV containing all the annotation, probe, and source
information derived from each FASTA input.
"""
import glob
import os
import time
import pandas as pd
from Bio import SeqIO
import bodo
df_sample = pd.DataFrame(
{"annotation": ["a"], "PROBE_SEQUENCE": ["p"], "source": ["s"]}
)
df_type = bodo.typeof(df_sample)
@bodo.wrap_python(df_type)
def fasta4epitope(fasta_file):
# Save sequence names and sequences in a list
seq_names = []
sequences = []
# Read fasta file containing epitope sequences
for record in SeqIO.parse(fasta_file, "fasta"):
seq_names.append(record.id)
sequences.append(str(record.seq))
# Save these in a data frame
seq_df = pd.DataFrame({"annotation": seq_names, "PROBE_SEQUENCE": sequences})
# Add a column with epitope name
seq_df["source"] = os.path.basename(fasta_file).split(".")[0]
# Change all columns to string
seq_df = seq_df.astype(str)
return seq_df
@bodo.jit(cache=True)
def process_all_fastas(fasta_files, out_path):
t0 = time.time()
combined_df = pd.DataFrame()
# Load each fasta file into a data frame and combine them
for i in bodo.prange(len(fasta_files)):
combined_df = pd.concat(
[combined_df, fasta4epitope(fasta_files[i])], ignore_index=True
)
# Save the combined data frame to a csv file
combined_df.to_csv(out_path, index=False)
print("Execution time: ", time.time() - t0)
if __name__ == "__main__":
# Directory containing the fasta files
directory = "data/"
# Find all fasta files in the specified directory
fasta_files = glob.glob(os.path.join(directory, "*.fasta"))
out_path = os.path.join(directory, "combined_fasta_sequences.csv")
# Process all fasta files and combine them into a single data frame
process_all_fastas(fasta_files, out_path)
No mpiexec required: Bodo handles the parallelization internally.
The process_all_fastas.py
example isn’t just about merging FASTA files; it’s a blueprint for scaling Python in data-intensive bioinformatics tasks. By leveraging Bodo’s just-in-time parallelization, you can tackle massive datasets without rewriting your code in a lower-level language. Whether you’re dealing with hundreds or thousands of files, Bodo helps you spend less time wrestling with performance and more time deriving meaningful insights from your sequence data.
This example is a blueprint for scaling Python in data-intensive domains, combining Python’s simplicity with high-performance computing principles. Bodo lets engineers tackle problems that were once restricted to low-level languages or distributed frameworks. For bioinformaticians, this means spending less time optimizing code and more time on biological insights.
Our focus at Bodo is to empower data scientists using Python to help do their job faster and better, without adding bottlenecks. This is an example in the bioinformatics domain, but Bodo is applicable much more broadly, and we’ll share more soon! Be sure to join our Slack community for updates.