One of the key values Bodo brings to our customers, especially those struggling with daily failures despite using expensive Spark cloud services, is Robustness, which means that Bodo production deployments can run without failure for months. Along with robustness, data teams require their data pipelines also to be resilient from possible failures.
In the data pipeline world, there are many widespread misconceptions about resilience such as “Spark is fault-tolerant, but MPI is not” (Bodo uses MPI) or “Spark’s fault tolerance makes its overhead worthwhile”. This blog post explains the disconnect between misconceptions and practical reality by investigating various aspects of resilience, starting from basic principles. Further, we use realistic scenarios and concrete examples to demonstrate that Bodo applications are much more resilient than Spark applications in practice.
Bodo runs on bare metal and does not incur the runtime failures Spark does, which are due to its runtime library complexity and Java Virtual Machine (JVM) challenges. In addition, Bodo’s speedup reduces the application runtime significantly, reducing the likelihood of hardware failure during each run (typically hours at most) while increasing the slack time to react to failures. Therefore, failures are very rare and a simple restart setup is enough for almost all analytics applications. The main exception is large-scale deep learning training, which may take weeks. In this case, using checkpoint/restart features in Tensorflow or PyTorch is recommended to avoid potential loss of computation in case of hardware infrastructure failure.
Achieving resilience requires a clear understanding of what it means in this context. Data analytics applications are fundamentally different from services deployed over distributed systems such as web portals. A web portal must always be available despite failures (continuous availability), even if it is in a degraded state (e.g., a bit slower to respond). Hence, every portal request is correctly responded to, despite the failure of a limited number of components and there is no downtime.
In contrast, an analytics application only requires a certain amount of computation to be finished as fast as possible even in the presence of failures. In this context, availability is not meaningful and the cluster can stop computing temporarily for recovery purposes, as long as the computation is done in a timely manner. In this sense, analytics applications are similar to High-Performance Computing (HPC) applications, and the fault tolerance techniques developed by the HPC community over the past decades apply to analytics as well.
There are two primary sources of random failure for analytics applications in production: 1) software infrastructure, 2) hardware infrastructure. Examples of software failures include Spark task failures and JVM errors. A hardware failure could be a server fan or a power supply failing. Ideally, the first category of failures should be minimized by avoiding unnecessary software layers which introduce potential failure points.
The most common and practical approach for resilience in parallel computing is checkpoint/restart. Typically, the programmer is responsible for checkpointing at optimal intervals (discussed below), and the platform is responsible for providing automatic error detection and recovery mechanisms.
Resilience in Spark seems complex but does not fundamentally change the resilience properties of analytics applications. Spark’s distributed data structures, Resilient Distributed Dataset (RDDs) or DataFrames, enable its key resilience features. Spark keeps track of transformations on RDDs/DataFrames in lineage or task graphs. When a node fails, Spark’s driver resumes execution by recomputing the lost computation outputs from the task graph.
Spark’s resilience model is based on distributed systems principles such as idempotent tasks. On the surface, this may seem important since it's best to only restart the lost computation of the failed node. However, analytics applications have communication dependencies across processors (e.g., shuffle for groupby/join), and the driver will in most cases end up restarting the whole computation from scratch (or the last checkpoint, if any). In addition, the programmer is responsible for manually checkpointing RDDs (cache/persist calls). Hence, Spark’s resilience model is actually a checkpoint/restart model in practice!
Bodo Engine is based on MPI and has no fundamental limitations in supporting fault tolerance. Bodo runs on bare metal, which avoids failures of Spark’s heavy runtime system. For example, Bodo’s general parallel architecture enables shuffling by all-to-all messages. The map/reduce architecture of Spark makes all-to-all messages impossible, and it has to use shuffle files instead. This leads to unnecessary “too many open files'' errors. In addition, Spark uses the Java Virtual Machine (JVM), which can cause unexpected failures. Another fact which fundamentally changes the tradeoffs for resilience is that Bodo is orders of magnitude faster than Spark. We’ll discuss these tradeoffs below.
Bodo can take advantage of resilience efforts by the HPC community over the past decades. Unlike Spark, Bodo Engine uses a “clean engine” approach and does not bundle any middleware components. This allows easier integration with modern middleware such as Kubernetes, which provides extensive resilience support such as automatic recovery (example below). We are also working on compiler-assisted checkpointing, where for instance, the compiler identifies all the necessary variables to checkpoint. This will simplify the checkpointing step as well.
A key question in checkpoint/restart is how often an application needs to checkpoint, and whether it should checkpoint at all. Theoretically, the Young-Daly formula provides an answer [2]. The optimal checkpoint interval depends on the Mean Time Between Failures or MTBF (M) of the nodes and checkpointing time (C):
For example, if a cluster’s MTBF is 6 months, and it takes 1 minute to checkpoint to permanent storage (e.g. S3), then the application needs to checkpoint every 12 hours.
As demonstrated, the failure rate of clusters (relative to execution time) is an essential factor for determining an optimal resilience strategy. Since Bodo runs on bare metal, the failure rate of hardware is the most relevant factor. For a physical server, the MTBF is often measured in years, e.g. 10 years [2]. For clusters, MTBF decreases linearly as the cluster size increases. For example, a typical 16 node cluster (e.g., 768 physical cores with AWS c5.24xlarge instances) has an MTBF of 7.5 months. Bodo accelerates application execution by orders of magnitude compared to Spark. Therefore, with Bodo, application execution time is often much shorter than hardware’s MTBF and many typical analytics workloads may not need to checkpoint at all.
Failures in cloud system software such as the AWS Nitro System can affect MTBF as well. The whole infrastructure could be compromised due to failures that may not be entirely random, such as software updates causing unexpected issues. Therefore, infrastructure redundancy such as replicating clusters in different regions could be necessary for highly critical applications. This applies to data center failures such as natural disasters as well.
As mentioned, Bodo’s bare metal execution enables higher resilience with less effort. On the other hand, Spark’s complex runtime can cause random software failures, sometimes on a daily basis. Therefore, while a Bodo application may not need to checkpoint at all, the Spark version of the same program can require frequent checkpoints. Typically, only long-running iterative machine learning applications like deep learning training have execution times much longer than Mean Time To Failure (MTTF). But fortunately, these applications are easy to checkpoint since only the model needs to be saved across iterations.
Cloud services often provide preemptible instances at a discount. For example, AWS Spot Marketplace provides instances based on a bidding price, but the instance may be taken away if the market price goes higher than the bid price. Therefore, the bid price can determine the MTBF of the cluster. In these cases, the discount presents a tradeoff for resilience that one needs to carefully consider[3]. The cost of extra restarts should be lower than the discount for Spot instances to be economical. In general, Bodo has no fundamental limitations for using Spot instances.
Bodo’s speed is a major economic factor in decision-making for analytics infrastructure. For example, if the discount is 80% (price is 5x lower), but Spark’s overheads make the application 30x slower than Bodo, using Bodo with on-demand instances is still cheaper than Spark on Spot instances. Therefore, while Bodo supports Spot instances, it can enable simpler on-demand infrastructure with better economics and without Spot complexity for many applications.
We use an example workload with taxi benchmark queries to demonstrate the resilience difference between Bodo (with Kubernetes for automatic restart) and Spark. We assume no user-provided checkpointing (i.e., full restarts) for simplicity, but a similar analysis applies in the checkpointing case as well.
Bodo (version 2022.2) takes 58 seconds to execute, while Spark on AWS EMR (version 2.4.8) takes 359 seconds. Bodo is ~6x faster, which is much less speedup than in real applications since this workload is minimal and I/O is a large portion of the execution time. Nevertheless, this speedup has significant implications for resilience.
To derive a concrete metric for resilience, we assume that Bodo and Spark have the same MTBF (ignoring Spark’s runtime failures) and that the Service Level Objectives (SLOs) require this workload to run in 15 minutes at most. In this case, the Bodo solution can support 15 restarts (floor(900s/58s)=15), while the Spark solution can support only 2 (floor(900s/359s)=2). Therefore, the Bodo solution is 7.5x more resilient than the Spark solution based on the number of possible restarts. This demonstrates that the performance of Bodo has a significant impact on resilience as well.
Overheads of master-executor systems like Spark have been justified as a “necessary evil” for achieving resilience. However, we have shown that Bodo can achieve much higher resilience without extra overheads. Unlike Spark, Bodo’s “clean engine” design enables easy integration with Kubernetes, ensuring reliable recovery. In addition, Bodo’s bare metal execution avoids software failures like Spark driver or JVM errors. Essentially, Bodo avoids a complex task scheduling system, reducing the points of potential failure substantially. Moreover, Bodo’s extreme performance fundamentally changes the tradeoffs for resilience. We plan to simplify the checkpointing process using compiler techniques and provide all the necessary resilience features on our cloud platform as well.
Visit our website, https://bodo.ai to learn more about Bodo products. See Bodo in action.
[1] Fault Tolerance in MPI Programs
[2] Scalable Message-Logging Techniques for Effective Fault Tolerance in HPC Applications
[3] Bidding for Highly Available Services with Low Price in Spot Instance Market