data engineering
Contact

Creating a cycling training data pipeline with GCP and Airflow

Mar 4, 2024

This is an overview of a personal project called antren.

When it comes to personal projects, I like to take the shortest path to get it done. It's usually about the destination and not the journey. This project was a bit different: I tried to make each step as un-optimized as possible for the subsequent step so that each step actually did something meaningfully beneficial (and gave me something to write about).

Here is the project in 3 simple steps:

Extract workouts from Garmin

Using Docker and Google Cloud Run, I deployed a beautifully slow python script (that uses BeautifulSoup), to get tcx files from Garmin, turns them into parquet files and uploads them to Google Storage.

Repo

Read More

Load parquet files into BigQuery

With Airflow, I orchestrated the step above as well as getting these wonderfully compressed, but very-hard-to-analyze parquet files into BigQuery.

Repo

Read More

Transform using dbt

With dbt, I turned nested JSON-like columns into rows that became metrics to track my cycling training data. Finally, I no longer have to pay a subscription to get exactly the information I want about my progress.

Repo

Read More

---
Last update: Mar 4, 2024
Privacy