Parallel FASTA Aggregation with Bodo: A Python Example for Scalable BioinformaticsParallel FASTA Aggregation with Bodo: A Python Example for Scalable Bioinformatics

Parallel FASTA Aggregation with Bodo: A Python Example for Scalable Bioinformatics

Date
February 20, 2025
Author
Rohit Krishnan

In modern bioinformatics, large-scale genomic data is the norm, requiring efficient ways to read, process, and combine massive collections of sequence files for meaningful analyses. Reading multiple large FASTA files—a routine step in genomics—can become a bottleneck when datasets scale. This post shows how to efficiently parse and combine FASTA sequences in parallel using Bodo, illustrating how quickly we can produce a single CSV of all sequences from many files.

Why is this a problem?

  • Python’s inherent limitations: Reading and processing large FASTA files in naive Python can be slow due to interpreter overhead, single-threaded loops, and memory overhead.
  • Scale matters: Some datasets contain thousands of FASTA files, each with potentially hundreds of thousands of sequences. This can take hours with a naive approach.
  • Memory issues: Merging large sets of sequences can cause high memory usage if done improperly.

This is where Bodo—a high-performance Python framework—comes in. The example demonstrates how to turn a computationally intensive task into a parallelized, memory-efficient process. Let’s break down how it works and why it matters.

What Does This Code Do?

The example reads multiple FASTA files in parallel and extracts sequence IDs alongside their peptide (or nucleotide) strings. Specifically:

  1. It identifies all FASTA files in a folder.
  2. For each file, it uses BioPython’s SeqIO.parse to read every sequence record.
  3. It stores each record’s annotation (ID), PROBE_SEQUENCE (the actual sequence), and source (the file's basename) in a DataFrame.
  4. It concatenates all of these into a single CSV file, saving users from having to manually merge or iterate over large datasets.

How Does It Work

From Python Loop to Parallel Execution

A standard Python implementation uses a for loop to iterate over the sequence. This works for small sequences but becomes impractical for gigabase-scale data. With Bodo, you add a @bodo.jit decorator to the function you want to parallelize, and a @bodo.wrap_python decorator to the function that then gets called, and that’s it:

@bodo.wrap_python(df_type)
def fasta4epitope(fasta_file):
    # Parse a single FASTA file, returning ID/Sequence pairs
    ...
    
@bodo.jit(cache=True)
def process_all_fastas(fasta_files, out_path):
    # Loop over each FASTA file in parallel, call `fasta4epitope`,
    # concatenate results, then save to CSV.
    ...

The only differences here are the @bodo decorators, extremely minimal code changes.

Bodo’s approach:

  1. @bodo.wrap_python(df_type) wraps a standard Python function so it can be called within a Bodo-jitted context.
  2. @bodo.jit + bodo.prange splits the list of files across parallel workers.
  3. Each worker reads its assigned FASTA file(s) and returns a partial DataFrame.
  4. The partial results are then concatenated into one unified DataFrame.

Why This Matters

  1. Performance without rewrites:
    • Bodo integrates smoothly with existing Python code, so you can parallelize FASTA processing without jumping to another language.
  2. Scalability on demand:
    • Works on a single machine (utilizing all cores) or distributed environments.
  3. Real-world impact:
    • Batch annotation: Quickly merge thousands of FASTA files into a single dataset for downstream analysis.
    • High-throughput screening: Large-scale experiments often generate huge numbers of FASTA files; parallelizing helps avoid I/O bottlenecks.

Installation and Execution

1. Install Bodo via pip

Bodo is pip-installable. On Linux/Mac:

pip install bodo 

2. Run the Example

Execute the script like standard Python:

"""
This is a pipeline for ingesting multiple FASTA files (using BioPython's SeqIO),
extracting sequence IDs and their corresponding peptide strings, and consolidating them
into a unified CSV file. It uses Bodo to parallelize the process, which can
significantly expedite data processing, especially on larger datasets.
The end result is a single CSV containing all the annotation, probe, and source
information derived from each FASTA input.
"""

import glob
import os
import time

import pandas as pd
from Bio import SeqIO

import bodo

df_sample = pd.DataFrame(
    {"annotation": ["a"], "PROBE_SEQUENCE": ["p"], "source": ["s"]}
)
df_type = bodo.typeof(df_sample)


@bodo.wrap_python(df_type)
def fasta4epitope(fasta_file):
    # Save sequence names and sequences in a list
    seq_names = []
    sequences = []

    # Read fasta file containing epitope sequences
    for record in SeqIO.parse(fasta_file, "fasta"):
        seq_names.append(record.id)
        sequences.append(str(record.seq))

    # Save these in a data frame
    seq_df = pd.DataFrame({"annotation": seq_names, "PROBE_SEQUENCE": sequences})

    # Add a column with epitope name
    seq_df["source"] = os.path.basename(fasta_file).split(".")[0]

    # Change all columns to string
    seq_df = seq_df.astype(str)
    return seq_df


@bodo.jit(cache=True)
def process_all_fastas(fasta_files, out_path):
    t0 = time.time()

    combined_df = pd.DataFrame()
    # Load each fasta file into a data frame and combine them
    for i in bodo.prange(len(fasta_files)):
        combined_df = pd.concat(
            [combined_df, fasta4epitope(fasta_files[i])], ignore_index=True
        )

    # Save the combined data frame to a csv file
    combined_df.to_csv(out_path, index=False)
    print("Execution time: ", time.time() - t0)


if __name__ == "__main__":
    # Directory containing the fasta files
    directory = "data/"
    # Find all fasta files in the specified directory
    fasta_files = glob.glob(os.path.join(directory, "*.fasta"))
    out_path = os.path.join(directory, "combined_fasta_sequences.csv")

    # Process all fasta files and combine them into a single data frame
    process_all_fastas(fasta_files, out_path)

No mpiexec required: Bodo handles the parallelization internally.

Conclusion

The process_all_fastas.py example isn’t just about merging FASTA files; it’s a blueprint for scaling Python in data-intensive bioinformatics tasks. By leveraging Bodo’s just-in-time parallelization, you can tackle massive datasets without rewriting your code in a lower-level language. Whether you’re dealing with hundreds or thousands of files, Bodo helps you spend less time wrestling with performance and more time deriving meaningful insights from your sequence data.

This example is a blueprint for scaling Python in data-intensive domains, combining Python’s simplicity with high-performance computing principles. Bodo lets engineers tackle problems that were once restricted to low-level languages or distributed frameworks. For bioinformaticians, this means spending less time optimizing code and more time on biological insights. 

Our focus at Bodo is to empower data scientists using Python to help do their job faster and better, without adding bottlenecks. This is an example in the bioinformatics domain, but Bodo is applicable much more broadly, and we’ll share more soon! Be sure to join our Slack community for updates.

Ready to see Bodo in action?
Schedule a demo with a Bodo expert

Let’s go