Data analytics and ML/AI applications are transforming many industries, such as healthcare, financial services, retail, and manufacturing, among others. However, developing and deploying these applications is highly challenging, even for the largest and most sophisticated companies. The fundamental problem is that data analytics has conflicting agility and performance requirements. Even after developing analytics workloads, achieving the required high-performance levels (only possible in HPC) is beyond the reach of most developers.
Bodo is a novel analytics engine technology that empowers data scientists on all compute platforms (e.g. public or private cloud) to run Python workloads with the extreme performance and scalability of HPC without code rewrites.
HPC vs. Enterprise Analytics
Application development has always been significantly different in HPC compared to mainstream enterprise analytics. HPC codes are usually written in low-level languages such as FORTRAN or C++ and explicitly specify parallelism, typically using MPI. These codes are extremely fast and can scale to massive supercomputers (the largest supercomputer in the world has over 7 million CPU cores), but require many highly trained HPC experts and often several years to develop. For example, a simple computation that needs exchange between neighboring processors looks like this in MPI/Fortran:
IF (irank > 0) THEN call MPI_RECV(iField,1, MPI_INTEGER, irank-1, tag1, comm, istatus, ierror) DO i=1,N FIELDO(i) = iField(1) + FIELDO(i) END DO END IF IF (irank < nproc-1) THEN iField(1) = FIELDO(N) call MPI_SEND(iField, 1, MPI_INTEGER, irank+1, tag1, comm, ierror) END IF
On the other hand, data scientists in enterprises use high-level scripting languages such as Python, which are easy to use. For example, summation across rolling windows of data looks like this in Python/Pandas:
The downside to using such high-level languages is that they are slow and sequential, sometimes as much as 100x slower than their potential optimized HPC version. Several computer science research projects have tried to bridge this productivity-performance gap, but did not even come close to HPC performance. Bodo is a new engine that offers Python simplicity and HPC scalability and performance, democratizing HPC for data analytics for the first time.
Bodo: The Elastic Python Cloud Engine
Bodo is an inferential compiler that enables large-scale, high efficiency, elastic, parallel processing of Python workloads for production deployment for all compute platforms. Bodo’s underlying automatic-parallelization and optimization technology is the culmination of several years of research (including breakthroughs at Intel Labs 1) that enables inferring the parallel structure of the application code for the first time.
Bodo allows the same code to run on laptops, cloud clusters, and edge devices efficiently. This new engine is not a library – e.g., it does not provide new pandas-like APIs in Python. In contrast, it compiles existing analytics codes using the familiar Just-in-time compilation workflow (similar to Numba), with the mere addition of a simple annotation.
Monte Carlo Pi Example
A Monte Carlo Pi calculation example is often used to demonstrate programming models of analytics engines. The Bodo version is standard Numpy code with a simple JIT annotation:
@bodo.jit def calc_pi(n): t1= time.time() x= 2* np.random.ranf(n) - 1 y= 2* np.random.ranf(n) - 1 pi= 4* np.sum(x ** 2+ y ** 2< 1) / n "Execution time:", time.time() - t1, "\nresult:", pi) returnpi
The execution workflow is the same as that of regular Python. The only difference is that Bodo compiles calc_pi and replaces it with a parallelized and optimized version automatically. This is as if an HPC expert rewrote the program in MPI/C++, except that it happens automatically and in real-time. Therefore, this code can now run efficiently on any number of processors without any rewrite.
In contrast, the Spark version of this example requires a complete rewrite of the code (and understanding Spark APIs) and has poor performance. Straightforward benchmarking of this example using an input size of 70 billion on a 4-node AWS cluster 2 shows 116X speedup of Bodo over Spark. Since the computation is fine-grained in this case, a task scheduling master/executor system like Spark spends most of the time in overheads instead of useful computation.
TPC-H Q3 Example
Processing tabular data (e.g., ETL) is at the core of many analytics workloads. Data scientists often write tabular data processing code in Pandas, but big data systems usually require SQL or DataFrame-like code. Bodo supports Pandas code without any change, while exceeding the performance of SQL systems significantly. For example, here is the Pandas code for the standard TPC-H Q3 benchmark:
@bodo.jit def q3(): date= "1995-03-04" lineitem= pd.read_parquet( "lineitem.pq") orders= pd.read_parquet( "orders.pq") customer= pd.read_parquet( "customer.pq")
flineitem= lineitem[lineitem.L_SHIPDATE > date] forders= orders[orders.O_ORDERDATE < date] fcustomer= customer[customer.C_MKTSEGMENT == "HOUSEHOLD"]
jn1= fcustomer.merge(forders, left_on= "C_CUSTKEY", right_on= "O_CUSTKEY") jn2= jn1.merge(flineitem, left_on= "O_ORDERKEY", right_on= "L_ORDERKEY") jn2[ "TMP"] = jn2.L_EXTENDEDPRICE * ( 1- jn2.L_DISCOUNT)
colnames= [ "L_ORDERKEY", "O_ORDERDATE", "O_SHIPPRIORITY"] total= jn2.groupby(colnames, as_index= False)[ "TMP"]. sum() total= total.sort_values([ "TMP"], ascending= False) res= total[[ "L_ORDERKEY", "TMP", "O_ORDERDATE", "O_SHIPPRIORITY"]]
Benchmarking this example on a 4-node AWS cluster3 using a 100GB dataset demonstrates 63x speedup over Spark, even though Spark is highly optimized for these use cases.
Hence, Bodo scales and accelerates Python analytics without rewriting code.
New horizons for Analytics
Aside from extreme HPC-like performance for python workloads, Bodo’s revolutionary technology provides orders of magnitude improvement along several other dimensions, which opens the door to new possibilities in the analytics/AI domain:
- Developer productivity: avoiding code rewrites and real-time execution times allow developers to build more complex applications in shorter time frames.
- Higher accuracy: real-time automatic scaling to larger datasets allows developers to iterate over models faster, improving model accuracy.
- Real-time Cloud and Edge applications: linear scaling and real-time performance enable applications that require smart real-time decision making, both in the Cloud and at the Edge.
- Infrastructure efficiency: lower cost provided by an efficient HPC architecture makes compute and data intensive applications much more economical.
Current Status and the Future Ahead
Bodo is in production use by several customers with great success, demonstrating unprecedented improvements in simplicity, speed, scalability, cost, and infrastructure efficiency compared to other engines. To provide access to the broader developer community, we are developing an easy-to-use and optimized Bodo Platform in the Cloud. Using intuitive web interfaces, the Bodo Platform automates all common infrastructure needs, including cluster creation, Jupyter notebooks, and job scheduling. This enables data scientists to focus on solving problems rather than DevOps and infrastructure setup. Contact us to join our Insider Program and gain early access to the Bodo Platform.
Bodo currently has comprehensive native optimization support for common analytics primitives, especially Pandas and Numpy APIs (see Bodo documentation), and we regularly expand and improve this support significantly on a monthly release schedule. We also provide support for more packages such as Scikit-learn and Scipy. Furthermore, any Python code can be integrated into Bodo code through existing mechanisms.
We are working with our customers and partners to make sure our roadmap meets the most pressing analytics/ML/AI needs, as we believe that Bodo technology has enormous potential to redefine all facets of analytics.
2: A cluster of 4 C5n.18xlarge instances, using AWS EMR (version 5.30.1, which was default at the time) for the Spark case and Bodo Platform (version 2020.08) for the Bodo case.
3: A cluster of 4 C5n.18xlarge instances, using AWS EMR (version 5.30.1) for the Spark case and Bodo Platform (version 2020.08) for the Bodo case, measuring compute-time only (we will discuss I/O in later blog posts).