This is part 1 of a personal project called antren.
I have a python script hosted as a Google Cloud Run Job that grabs my last day of bike rides via the Garmin API and converts the original tcx files (xml) to parquet. The parquet files are stored in Google Cloud Storage.
Repo: alhankeser/antren-app It's called "app" because it plays the role of what an app would do: generate data that I later transform and analyze.
I started by looking for libraries to interact with the Garmin API. There isn't a clear way for non-business users to get access to the API, but I found this library. It helped me understand how I could get this done. I didn't end up using the library because it was more than I needed, but I did borrow a function from it for looping through pages of activities.
It's fairly easy to understand without comments, but I'll add some anyway:
def main():
// Get an authenticated Garmin API client
// If token has expired, a new one will be gotten
api_client = garmin.authenticate("garmin.com")
// Get the last day of activities
activities = garmin.get_last_day_of_activities(api_client)
logging.info(f"Got {len(activities)} activities")
// Loop through activities
for activity_data in activities:
activity_id = activity_data["activityId"]
activity_name = activity_data["activityName"]
// Get tcx xml file from Garmin and save it locally
activity_raw_data = garmin.get_activity_raw_data(
api_client, activity_id, format="tcx"
)
raw_file_path = file_manager.save_activity_raw_file(
activity_id, activity_raw_data
)
// Convert the file to parquet, keeping the fields that I care about
// Save it locally
converted_file_path = file_manager.convert_activity_file(
activity_id, raw_file_path, format="parquet"
)
// Delete original file from local
os.remove(raw_file_path)
// Upload converted file to cloud
file_manager.upload_to_cloud(
converted_file_path, converted_file_path.split("/")[-1]
)
// Delete converted file from local
os.remove(converted_file_path)
logging.info(f"Successfully saved: {activity_name} ({activity_id})")
The most time-consuming part of this was correctly parsing the tcx file. Here's a sample of what a tcx file looks like:
<Activities>
<Activity Sport="Biking">
<Id>2024-01-18T14:29:07.000Z</Id>
<Lap StartTime="2024-01-18T14:29:07.000Z">
<Track>
<Trackpoint>
<Time>2024-01-18T14:29:08.000Z</Time>
<AltitudeMeters>13.0</AltitudeMeters>
<DistanceMeters>2.119999885559082</DistanceMeters>
<HeartRateBpm>
<Value>78</Value>
</HeartRateBpm>
<Cadence>27</Cadence>
<Extensions>
<ns3:TPX>
<ns3:Speed>1.8029999732971191</ns3:Speed>
<ns3:Watts>107</ns3:Watts>
</ns3:TPX>
</Extensions>
</Trackpoint>
</Lap>
</Activity>
</Activities>
I researched various xml parsing libraries and settled on BeautifulSoup. I know it well due to my various scraping projects in the past. I used the lxml-xml
parser for my purposes.
Permalink to the full tcx parsing function
soup = BeautifulSoup(tcx_file, features="lxml-xml")
track_points = soup.find_all('Trackpoint')
if len(track_points) == 0:
return False
The fields I care about are time, watts, and heart rate. My approach was to make sure there was a row per TrackPoint (aka a second of activity time), regardless of if there was heart rate or watts available at the TrackPoint. In cases where either was not available, I set the value to 0. I created a list of dicts, each representing one second and converted this, in a roundabout way, to a single dict. This could use some refactoring to be way more performant. Ultimately, I end up with a pandas dataframe, converted to a parquet file, that looks like this:
activity_id | start_time | data |
---|---|---|
13947598259 | 1707845190 | "{'time': [1707845190, 1707845191, 1707845192...], 'watts': [425, 399, 384...],'heart_rate': [141,141,142...]}" |
You might wonder what I was thinking and why I "didn't just flatten" the time, watts, and heart_rate metrics since I had complete control over the creation of the parquet files. The answer is I wanted the subsequent transformation steps in dbt to be harder than they would have been otherwise. It's too easy to construct end-to-end pipelines where each step is optimized for the final outcome and that's just not realistic. I wanted there to be problems remaining to be solved in later stages.
The parquet file (one per cycling activity) are temporarily stored locally, then uploaded to Google Cloud Storage and deleted from local.
Writing a Dockerfile for a python script is fairly trivial. The bit that was challenging was giving the correct permissions to the user created in the Dockerfile to create and manage files. I ended up adding this to my Dockerfile:
ADD --chown=dockeruser --chmod=700 ./activity_files ./activity_files
7 = read, write, execute for owner 0 = no permissions for group 0 = no permissions for others
The entry point for running the script is as simple as one can be:
CMD python main.py
Finally, I use this command to create and run the container:
docker compose up --build
I chose to use Google Cloud Run to host my python script. I could have put it anywhere. I wanted to go through the flow of doing this on GCP as part of this project. I might decide to move all of this elsewhere just for the practice.
First I needed to get the container into an Artifact Registry (fka Container Registry). To do that, I tag my image with the path of the registry and push to it.
docker tag antren-app-server europe-west9-docker.pkg.dev/antren/antren-app/app
docker push europe-west9-docker.pkg.dev/antren/antren-app/app
Then I created a Google Cloud Run Job that would point to the latest version in the Artifact Registry:
gcloud run jobs create app \
--image europe-west9-docker.pkg.dev/antren/antren-app/app:latest \
--set-env-vars BUCKET_NAME=***,GARMIN_EMAIL=*** \
--set-secrets GARMIN_PASSWORD=***:latest
This creates a job called "app" which points to my newly pushed Docker container and has the necessary environment variables along with it. I had to create a secret in Secret Manager for a sensitive variable. To see what a particular job's config looks like, I can use gcloud run jobs describe JOBNAME
:
gcloud run jobs describe app
✔ Job app in region europe-west9
Executed 49 times
Last executed 2024-02-13T22:25:29.554814Z with execution app-zldvk
Last updated on 2024-01-26T14:43:36.743203Z by ********@******.****
Image: europe-west9-docker.pkg.dev/antren/antren-app/app:latest
Tasks: 1
Memory: 512Mi
CPU: 1000m
Task Timeout: 10m
Max Retries: 3
Parallelism: No limit
Service account: *********@antren.iam.gserviceaccount.com
Env vars:
BUCKET_NAME ********
GARMIN_EMAIL ********
Secrets:
GARMIN_PASSWORD ********:latest
I needed a place to store all of my parquet files (~2,500 bike rides), so I created a Google Storage Bucket:
gcloud storage buckets create gs://activity_files \
--project=antren \
--location=europe-west9
Uploading from my python script goes something like this:
import os
from google.cloud import storage
from dotenv import load_dotenv
load_dotenv()
BUCKET_NAME = os.getenv("BUCKET_NAME")
def upload_to_cloud(source_file_name, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.bucket(BUCKET_NAME)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
To check what files are in my bucket, I use:
gcloud storage ls --recursive gs://activity_files
gs://activity_files/3242914954.parquet
gs://activity_files/3246658518.parquet
gs://activity_files/3246699386.parquet
...
Next, I will describe the process of loading this into BigQuery.