Amazon S3 is one of the most popular technologies that data engineers use to store data as a data lake. One of the typical applications is to read compressed parquet files as part of the extract process in an ETL (extract-transform-load) pipeline. Bodo’s true parallel approach to computing can decrease your reading time proportional to the number of CPU cores of your server. Here is a short example comparing Python with and without Bodo:
To read data from S3, use the code snippet below. s3fs and pyarrow are the popular file system and parquet libraries used for this purpose. In this miniblog we are using Bodo’s public S3 to read ~230MB of parquet files partitioned in 10 parts.
This code takes 71 seconds to download and read data on a local MacBook with 700Mbps internet speed.
Now let’s see how Bodo performs using only 4 CPUs.
Bodo takes 22 sec, which is around 1/4 of the time taken with pandas, because we used 4 cores.
With that simple change, you can reduced your compute expense by 4x. This is a small data set that you can try on a laptop. Now imagine how much you can save when dealing with Terabytes (if not Petabytes) of data!
You can use Bodo’s Platform to spin up large EC2 instances for larger data. The platform creates Bodo clusters automatically. You can read gigantic data in matter of seconds. In this example, I am reading 1Gb of parquet data in 2.6 sec with Bodo using a cluster with two c5n.9xlarge instances. Running this without Bodo took 59 seconds.
With Bodo: 2.58 sec
Without Bodo: 59.42
Let’s Bodo!