data engineering
Contact

Creating a faster LookML parser, from scratch, in Zig

Sep 5, 2024

How it started

One day while timing a Python script that needed to parse many LookML files, I noticed that things were taking a really long time. I narrowed down the source of the slowdown to the parser of the LookML files (needed to parse into a dict). The parser I was using is the only one I know anyone use: lkml. It works just fine and there's no real reason to make another one.

But its slowness bothered me. And I've already created a parser of sorts, the tcx-extract Python package, written in Zig. What could possibly go wrong? LookML is just slightly less neatly structured than XML, right?

What is LookML

LookML is a file format that looks like something between yaml and json. It's got rules that are at a field level. For example here is what a single field of a "view" can look like:

dimension: name {
    type: string
    label: "Customer Name"
    sql: initcap(${TABLE}.name) ;;
    drill_fields: [name, email, phone]
    action: {
	     # some other stuff happening here
    }
  }

Parameters Docs

There are many possible parameters for each field and they all fall into a variety of value types:

  • non-quoted: yes | no | number | string...
  • quoted: "my label"
  • sql: full-on sql with ability to use variables and looker keywords
  • list
  • object

Example LookML View file

This variety of field types, combined with may lack of experience with parsers, not to mention my complete lack of understanding for how Zig memory allocation really works, resulted in a painful road to enlightenment.

All the ways I failed before I succeeded

At first, I tried splitting on spaces. That failed immediately.

Then I split on a variety of characters. That became hard to manage.

Then I started going character-by-character. This is where the "magic" started.

I created a "buffer" of sorts where I appended the latest chars, and then cleared it every time I found what was a key or a value. I detected what kind of value I was working with based on how the value started -- unless the key was sql-related, in which case it was freestyle until we reach ";;". Every value was assigned a terminal value.

Where I lost massive amounts of time was trying to get nested objects to be able to print themselves out at the end. What do I mean by this: for example I created a View struct where I saved pointers to Field structs to in a HashMap. Then at the end, I wanted each field to print itself out to produce a json output. But when I would try to access the children Field's parameters, I would get segfaults. I tried every possible thing I could about this and nothing worked. Easily lost two weeks going down this fruitless path.
At this point, I could have stopped and started learning Zig. No time for that! I must waste time hitting my head against this wall instead!

Then, in a single all-nighter, I wrote the most disgusting parser ever created in this nascent language. What you can't fix with knowledge, experience and skills you can fix with if statements. Many, many if statements.

And guess what? It's fast AF. But completely untrustworthy at this point without testing on a lot more files. I'm going to throw a bunch of public lkml files at it, see how it does and if it's within reason, I'll keep tweaking and actually turn it into a Python library, albeit with a giant warning on it.

What I learned

Read the docs: I need to really learn Zig before trying to do recursion.

Learn the academic approach to parsers/interpreters: unless I have a good re-invention of it, someone smarter has likely solved most of these problems already.

Repo: alhankeser/lkml-parser

Attempt #2

A day after I wrote the above, I did a complete rewrite of the parser, this time using a more traditional approach. I have a Reader, a Tokenizer and Parser, nicely separated out, consuming a minimum amount of memory and working nicely. However, I'm unable to, with my current level of understanding of how Zig works, to properly nest nodes without running into either a segfault, unreachable code errors or accessing a union field while another field is accessed (even though it's clearly not the case). Essentially, any time I do anything recursive, things go terribly wrong. I've put in a lot of effort to troubleshoot, but have exhausted all avenues to finding a solution.

Sadly, I'm going to put this project down for now until my level of understanding of Zig increases to where I can knowledgeably problem-solve. The approach of brute-forcing my way to a solution is not working.

---
Last update: Sep 21, 2024
Privacy