data engineering
Contact

Using Zig to speed up xml data extraction in Python

Mar 11, 2024

The outcome

I increased the speed of my Python script that converts Garmin tcx files (XML format) into parquet. Below shows (on my machine) the difference in performance for v1 (using BeatifulSoup and a Python for-loop) vs v2 (my new tcx-extract package that uses Zig), tested in a non-scientific manner.

❯ python main.py
INFO:root:Authenticating API
INFO:root:Got 1 activities
TCX to Parquet v1 conversion took 14.66s
TCX to Parquet v2 conversion took 0.27s

See the commit where I made this update.

Why Zig and why not Rust

As part of my Master's of Computer Science coursework, I've been experimenting with using a lower-level language to play with the various sorting algorithms that I'm learning about. While C came to mind, I decided to try one of the newer languages, because why not.

I chose Zig because I got a good vibe from the community and hearing about the TigerBeetle database made me curious to learn more.

Why not Rust? This could have also been done in Rust. Or practically any language for that matter.

The Zig part

My Zig executable accepts two arguments: filepath to a tcx and the name of the target tag we want to get data for.

Here's what it does:

  • Allocates some memory to read the file.
  • Splits text by the tag <TrackPoint>.
  • Loops through each TrackPoint, splitting each item by the target tag.
  • It grabs the value of each target tag (even if null) and stdouts it.
  • Adds a linebreak after each value.

I'd like to say it was the most elegant thing ever created, but alas, it is not. It makes up for its looks with its results though:

_ = points.next();
    while (points.next()) |point| {
        var tagBeforeAfters = std.mem.split(u8, point, targetTagEnd);
        while (tagBeforeAfters.next()) |tagBeforeAfter| {
            var tagAfter = std.mem.split(u8, tagBeforeAfter, targetTagStart);
            _ = tagAfter.next();
            while (tagAfter.next()) |tag| {
                _ = try stdout.write(try std.fmt.allocPrint(allocator, "{s}", .{tag}));
                break;
            }
            _ = try stdout.write("\n");
            break;
        }
    }

The Python part

Python acts as a controller. There's a function to build the Zig executable on whatever machine the package is being run on. The trickiest part about this was to get the executable in a place that was accessible. I'm probably doing something wrong here, but I got it working by finding my way to the path of the package and getting the executable in a familiar directory.

Here is the entirety of the python extract.py file:

import subprocess
import os

def get_tag(filepath: str, tag_name: str) -> str:
    cwd = os.path.abspath(os.path.dirname(__file__))
    abs_path = os.path.join(cwd, 'zig', 'extract')
    return subprocess.check_output([abs_path, filepath, tag_name]).decode('utf-8')

def extract(filepath: str, tag_name: str) -> list[str]:
    result = get_tag(filepath=filepath, tag_name=tag_name)
    if len(result) > 0:
        return result.split('\n')[:-1]
    else:
        return []

Improvements

This was my first time publishing a package, so everything was a challenge. Here are some things I'd like to work on next to improve its usefulness and just for fun:

  • Replace the hardcoded <TrackPoint> and make it possible to use any repeating element as the spine for the extracted data points.
  • Make it possible to pass nested tags (ie HeartRate > BPM) instead of the child tag.
  • Make it possible to pass the parent tags (ie HeartRate) and still only return the inner-most data points.
  • Improve my GitHub Actions for simple package release.
---
Last update: Mar 23, 2024
Privacy