Overcoming the Hidden Debt of Big-Data Analytics

August 24, 2021

Behzad Nasre

‍

Both the tech and traditional industries are waking up to the rise of big data analytics and the ever-growing abundance of data. Businesses know that using large-scale analytics, Machine Learning (ML), and Artificial Intelligence (AI) across terabytes and petabytes of data yields smarter operational decisions, increased customer transaction sizes, and more efficient automation.

However, there’s also the harsh reality of “technical debt”: With all of these analytics, Moore’s Law is being pushed to the limits of hardware and physics, resulting in ever-increasing computing cost, time, and even energy consumption. Plus, Data Engineers/Scientists need to learn ever-new languages and technologies to try to go faster.

What if we could improve analytics performance by 1,000x, decrease learning curves to 1/10, and reduce aggregate operational expenses costs to 1/10 -- using the programming techniques and hardware already in use today? With our Series A funding, that’s what we at Bodo are committed to doing. Broad access to inexpensive, near-real-time large-scale analytics will enable new data-centric revenue opportunities, faster competitive responses, and radically more efficient overall data operations. Gartner Research recently observed that by the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI, driving a 5X increase in streaming data and analytics infrastructures1. Clearly, a plethora of new business possibilities are now within reach - especially to developers and data scientists who didn’t have access to this power before.

How We Got Here

In the past decade, we have seen a rapid shift from building private data centers to using cloud computing platforms to reduce maintenance costs and enable elastic horizontal scaling. We have also seen a rapid development of data and computation scaling technologies, beginning with Hadoop MapReduce and Apache Spark to the emergence of a host of scaling solutions in the past few years. However, while many of these solutions claim to have 100x or 1000x performance improvement, the reality is they all have untold burdens and compromises -- the “technical debt”.

They fail to take full advantage of computing power, often resulting in multiple idle wait-states.
Multiple frameworks and libraries force developers into ever-increasing learning curves.
Data Scientists who prototype in Python must hand-off their work to other development teams and infrastructure specialists to pilot and operationalize concepts for production.

However, Bodo’s vision is to develop computing solutions that do not compromise on cost, speed, scale, or simplicity -- and we do so by enabling supercomputing-style MPI parallel performance in native Python.

Unlocking Competitive Value

It is no surprise that modern big-data analytics and machine learning are expensive, and the cost is rapidly growing as the abundance of data skyrockets. For example, the cost of staffing a data team to scale business analytics and data pipelines for production can easily exceed millions of dollars2 as many critical roles need to be filled3. Unfortunately, this high entry point, combined with questionable ROI, makes many businesses hesitant to explore new business opportunities and find new revenues using big-data analytics and machine learning.

In the world of big data, speed is money. The impact of applying big-data analytics is prevalent in most top-500 companies around the world. Often a slight increase in the pipeline efficiency will lead to significant business gains. For example, in 2006, Netflix awarded $1M to a team that bested the company’s recommendation algorithm by 10%4. Now that recommendation engine is worth over $1 Billion5. Recent studies by Grand View Research Inc. show that the big data market is expected to reach USD $123 billion by 20256 with current technologies, but the barrier of entry to this market is limited to the largest companies and institutions. But what if there is a paradigm-shifting new technology that gives the power of supercomputers to everyone?

Since 2018, we have embarked on the quest to apply big-data analytics elegantly without hidden debts (more on this below) by creating Bodo.ai. Presently, we have significantly lowered the bar of entry for accessing supercomputing-like power. We made it possible to eliminate weeks-long machine learning development-to-production cycles and to scale the machine learning model on the same day it is developed without code rewrite. So imagine how much more data teams could do if they didn’t have to wait for data processing to work on the business problem? What if teams could get business insight from billions of customer entries daily instead of monthly or quarterly reports?

And finally, existing big-data analytics are also environmentally taxing, consuming 1000’s of physical compute-hours. What if corporate responsibility teams could market businesses as energy conscious or carbon-conscious entities? With Bodo’s speed and efficiency, what business opportunities would be possible if machine learning models could provide real-time inferences?

Making The Hidden Technical Debt Visible

Without a doubt, the data science industry both loves Python for its simplicity as a scripting prototype language, and hates it when it won’t scale for production. It is further frustrating due to the need to rewrite data models in other languages such as C++ or Java/Scala for performance and reliability - at the expense of steep learning curves.

While existing frameworks and APIs for scaling, such as Spark and Dask, aim to address some of these issues, they compromise on requiring extra “glue code” to work. Glue codes are inefficient and have high failure rates that are common with production jobs7. On the one hand, the apparent cost for business operations is the need to hire data engineers for provisioning resources and databases on big-data platforms and machine learning engineers that would translate prototype models into production. Creating complicated data pipelines also leads to direct cloud infrastructure costs. On the other hand, many hidden costs to the productivity of the data team, which can be significant to business budgets when aggregated, are often overlooked.

Context switching: A study by Psychology Today shows that context-switching can lead to a 40% decrease in productivity8, costing the global economy 450 billion dollars annually9. Upgrading from development to production environment requires investing developer time and hiring new talent.
Skill Sets: Most new data science hires out of college only know Python or SQL, making it challenging to find the right talents with affordable salaries and costly labor and resources to retrain them to tune Spark or other “production” languages at a professional level. A 2020 survey by StackOverflow of 65,000 developers shows that scala developers held the highest salaries, especially in the US10.
Glue Code: Lastly, using “glue code” within the pipeline at large scales is complex, inefficient, and can contribute to additional maintenance and computation costs. Sculley et al. 2016 from Google describe that 95% of code for mature machine learning systems is glue code11, the dominant contribution to development cost. Glue codes are also the weakest links in the pipeline, prone to failure and adding to maintenance burden12. While sometimes glue code is a necessary evil, if not managed properly, it becomes excessively complicated spaghetti code that negatively affects cost and performance13.

How to Overcome These Debts?

At Bodo, it is our mission to enable Python for mission-critical data analytics production tasks. We are thrilled to announce that we have created an unprecedented extreme-performance parallel compute platform for data analytics and ML that can showcase linear scaling -- beyond Terabytes of data and 10,000’s of cores -- with exceptional efficiency.

Recently we have been working with a Fortune 10 company to demonstrate our capabilities in workflows that require handing billions of entries daily. They’ve found Bodo to be consistently over 10x faster against highly-optimized Spark. For more detail, see our recent blog post on benchmarking using the TPCx Big Bench Query 26, a comprehensive benchmark that can represent data engineering workloads for various applications, including machine learning pipelines.

At the core, Bodo uses the same techniques as the best supercomputers to scale far beyond 10,000 cores but without the need to write low-level machine code. By building the first inferential compiler for automatic parallelization, we have democratized the capability to solve some of the world’s most challenging data analysis problems using a fraction of time, complexity, and cost of existing solutions, all while leveraging the simplicity and flexibility of native Python. Developers can deploy Bodo on any infrastructure, from laptops to public cloud and using any commercially available connectors, from AWS to Azure.

Now, having easy access to supercomputer-like power, what could you do if you speed up your analysis by 100x+ with a tenth of the cost? We welcome you to check us out at Bodo.ai and contact us via Slack or Discourse if you have any questions or want to share your use case! #LetsBodo