Reading and Converting Data
While these files may be read directly from source into dataframes, we prefer to avoid performance issues related to networking and focus on dealing with the files locally. You can download the requisite files with the following script:
The same can be done for the 2017 dataset, as shown here.
In our ETL pipeline we want to run a piece of code that looks like this:
Enter Bodo: we can run the exact same function with minimal changes and drop that number down to 6 seconds using 8 cores. Let's see how.
If you're running your code in a Jupyter notebook, you'll want to launch a IPyParallel cluster to start:
Note: when running on the Bodo.ai platform, this step is unnecessary and already taken care of.
With a cluster available, we then run our cells with the %%px magic as below:
With the same imports we now run the same script again, but now lean on the benefits of using Parquet.
On the same hardware, all of this now took less than 1.1 seconds. This is possible in part because Parquet uses columnar storage. While the CSV files must be read in entirety, using Parquet allows Bodo to extract efficiently only the subset of columns needed.
Should you run the same code, yet again, but choose to read in the segmented parquet files that roughly 1 second drops further to about 0.5 seconds. Not as exciting when you're just running it once, on small files; but if you had a process that had to be run millions of times, on lots of data, that's a hefty chunk of compute cost just saved.
From 40 seconds down to 0.5 seconds with little hassle
For a single file you'll also be paying some time up-front compiling, but as the job grows that compile time still only happens the once and becomes negligible for large-scale work
The issue is that Bodo and native Python do not always represent data structures in the same way—Bodo represents data using efficient native data structures. This means that passing data from the top-level Python namespace into a Bodo-jitted function involves 'unboxing' that object from its native Python representation. Similarly, returning an object back to the top-level Python namespace from a Bodo-jitted function requires 'boxing' that object back into Python's native object representation.
Given pseudo-code:
This simply means that rather than running
If you've made it this far and are still interested in knowing more, we suggest looking at the final three notebooks in this repo. We cover a few final details about scaling these ideas for production and how to generalize a pipeline to work with many different data sources.