data engineering
Contact

Creating a faster LookML parser, from scratch, in Zig

Sep 5, 2024

Post in progress

How it started

One day, at work, while timing a CI script that needed to parse many LookML files, I noticed that things were taking a really long time. I narrowed down the source of the slowdown to the parser of the lookml files (needed to parse into a Python dict). The parser I was using is the only one I know anyone use: lkml. It works just fine and there's no real reason to make another one.

But its slowness bothered me. And I've already created a parser of sorts, the tcx-extract Python package, written in Zig. What could possibly go wrong? LookML is just slightly less neatly structured than xml, right?

What is LookML

LookML is a file format that looks like something between yaml and json. It's got rules that are at a field level. For example here is what a single field of a "view" can look like:

dimension: name {
    type: string
    label: "Ligma Names"
    sql: initcap(${TABLE}.id) ;;
    drill_fields: [email, phone]
    action: {
	     # some other stuff happening here
    }
  }

Parameters Docs

There are many possible parameters for each field and they all fall into a variety of value types:

  • non-quoted: yes | no | number | string...
  • quoted: "my label"
  • sql: full-on sql with ability to use variables and looker keywords
  • list
  • object

Example LookML View file

This variety of field types, combined with may lack of experience with parsers, not to mention my complete lack of understanding for how Zig memory allocation really works, resulted in a painful road to enlightenment.

All the ways I failed before I succeeded

At first, I tried splitting on spaces. That failed immediately.

Then I split on a variety of characters. That became hard to manage.

Then I started going character-by-character. This is where the "magic" started.

I created a "buffer" of sorts where I appended the latest chars, and then cleared it every time I found what was a key or a value. I detected what kind of value I was working with based on how the value started -- unless the key was sql-related, in which case it was freestyle until we reach ";;". Every value was assigned a terminal value.

Where I lost massive amounts of time was trying to get nested objects to be able to print themselves out at the end. What do I mean by this: for example I created a View struct where I saved pointers to Field structs to in a HashMap. Then at the end, I wanted each field to print itself out to produce a json output. But when I would try to access the children Field's parameters, I would get segfaults. I tried every possible thing I could about this and nothing worked. Easily lost two weeks going down this fruitless path.
At this point, I could have stopped and started learning Zig. No time for that! I must waste time hitting my head against this wall instead!

Then, in a single all-nighter, I wrote the most disgusting parser ever created in this nascent language. What you can't fix with knowledge, experience and skills you can fix with if statements. Many, many if statements.

And guess what? It's fast AF. But completely untrustworthy at this point without testing on a lot more files. I'm going to throw a bunch of public lkml files at it, see how it does and if it's within reason, I'll keep tweaking and actually turn it into a Python library, albeit with a giant warning on it.

What I learned

Read the docs, learn the theory, stop winging it.

But also: I will not give up until I solve the problem, even without all the tools.

Repo: alhankeser/lkml-parser

---
Last update: Sep 5, 2024
Privacy