/ Keksipurkki / Posts

Musings about JSON and jq

2021-03-06

Introduction

Oh, JSON, JSON, JSON…what would the world be without you? Let me show you what I have learned recently about this ubiquitous data-interchange format. It turns out you can manipulate complex JSON strings on the command line with ease. This came to me as somewhat of a surprise as I always thought that JSON is fundamentally incompatible with traditional command line tools, the Unix filters cat, grep, sed and friends, that operate on ad hoc structured text files in a line-oriented fashion.

As a demonstration, let’s write a script to flatten and unflatten JSON objects. That is, let’s figure out a flattening algorithm for mapping inputs like this

{
  "menu": {
    "id": "file",
    "value": "File",
    "popup": {
      "menuitem": [
        { "value": "New", "onclick": "CreateNewDoc()" },
        { "value": "Open", "onclick": "OpenDoc()" },
        { "value": "Close", "onclick": "CloseDoc()" }
      ]
    }
  }
}

to a representation that encodes nested structures like this

[
  {
    "Key": "menu/id",
    "Value": "file"
  },
  {
    "Key": "menu/value",
    "Value": "File"
  },
  {
    "Key": "menu/popup/menuitem/0/value",
    "Value": "New"
  },
  {
    "Key": "menu/popup/menuitem/0/onclick",
    "Value": "CreateNewDoc()"
  },
  {
    "Key": "menu/popup/menuitem/1/value",
    "Value": "Open"
  },
  {
    "Key": "menu/popup/menuitem/1/onclick",
    "Value": "OpenDoc()"
  },
  {
    "Key": "menu/popup/menuitem/2/value",
    "Value": "Close"
  },
  {
    "Key": "menu/popup/menuitem/2/onclick",
    "Value": "CloseDoc()"
  }
]

The inverse of the flatten algorithm is naturally unflatten. It takes a flattened JSON and reproduces the original nested structure.

JSON juggling of the above kind occurs quite frequently in the world of programming so this exercise should have some practical usefulness as well.

Normally I would bang out a NodeJS script that gets the job done. After all, JSON stands for JavaScript Object Notation. Then there is jq that advertises itself as “a lightweight and flexible command-line JSON processor.” I was aware of its existence but never gave it too much of attention.

Turns out jq frigging rocks! It seems to be an extremely well-designed piece of software. It fulfills a precise need and feels instantaneously familiar. As an added bonus, it only weighs 1MB, making it ideal in the Docker game where minimalistic runtimes are key. But let’s not get ahead of ourselves, let’s first see how jq fares with the programming challenge we posed for ourselves.

Flatten

The flatten algorithm is a one-liner in jq:


def json_flatten:
  [ paths(scalars) as $path | { Key: $path | join("/"), Value: getpath($path) } ]
;

Obviously, this solution did not just spring into my mind immediately. There were many intermediate solutions and false starts. But in a way that is precisely what makes filter-based languages like jq so elegant. You can just start from anywhere and try a filter that does something potentially useful. Stuck in a dead-end? Just revert back the previous filter and take it from there. Optimization amounts to reducing the number of filter stages.

You just have to take the correct initial step and the algorithm will emerge on its own during the process. What is more, the code reads like a mathematical proof which it ultimately is, following a top-down structure and leaving past a trail of intermediate results, each with a well-defined input and output.

Anyway, I had the hunch that the flattening algorithm is going to be very simple — it had all the smells of it. Reading the jq manual, the functions paths(PATTERN) and getpath(PATHS) seemed like useful little nuggets, and it turns out they get the job done. You may RTFM, too. Their use above should give you a clue what they do.

Unflatten

The unflatten algorithm is similarly concise

def json_unflatten:
  map({ Path: .Key | split("/") | map(tonumber? // .), Value: .Value })
  | reduce .[] as $item ({}; setpath($item.Path; $item.Value))
;

Here we can showcase how jq operates. The first map operates on array input [{ Key, Value }, { Key, Value }...]. We produce a path expression by inverting the join call with the split call. This maps menu/popup/menuitem/0/value to ["menu", "popup", "menuitem", "0", "value"]. However, there is now one glitch. For jq to reconstruct the original JSON, the index "0" should be converted to 0. This is what map(tonumber? // .) does.

The reduce stage then starts to update an empty object {} in-place with the setpath function. As the flattening scheme produces an unique path expression for each property, every setpath call will recover one new property of the final output:

{ "menu": { "id": "file" }
{ "menu": { "id": "file", "value": "File" } }
{ "menu": { "id": "file", "value": "File", "popup": { "menuitem": [{ "value": "New" }] } } } }
...

Discussion

Everything sure looks a-OK, but there is one major pitfall in the solution. In mathematical terms, the json_flatten function is not an isomorphism between arbitrary JSON strings and flattened JSONs. In plain speak, there will be buggy inputs.

To start with, consider the JSON

{
  "root": {
    "some/path": true
  }
}

From the point of view of json_flatten, it is equivalent to

{
  "root": {
    "some": {
      "path": true
    }
  }
}

This problem could be fixed by URL-encoding the forward slash. However, as of version 1.6, there’s no support for URL-decoding. An open GitHub issue about this limitation exists.

Then there is a far more serious issue. The JSON { "root": { "0": true } } and { "root": [true] } are also equivalent inputs, and there is nothing that can be done about this without changing the specification of the flattening scheme.

This is problematic as now there are inputs like this

{
  "root": {
    "1e12": "one malloc(gazillion) plz"
  }
}

Sane JavaScript engines fall back to representing sparse arrays as hashmaps, saving us from a heap overflow. Do you dare to run jq 'json_flatten | json_unflatten' on that input?

What to do? Well, Pinky, the same we do every time. We validate our input. In mathematical speak, we redefine the domain of the mapping. After all, numerical and path-like object keys are cornercases so it’s perfectly fine to just reject inputs containing them.

The end result flatten.jq with the above considerations reads in its entirety

Notice the shebang line. You may make the above snippet an executable and invoke it as ./flatten.jq FILE or ./flatten.jq FILE --args unflatten.

Where could this be useful? Like mentioned in the introduction, one place where shell scripting is still required is in writing glue code for apps running in Docker containers. A typical startup sequence of a Docker container is made of first fetching the application configuration from some data store as JSON and then starting the application. A strict Twelve-Factor App approach requires that the configuration be exported to the application environment as environment variables. However, for most cases it suffices or is actually safer to just save the configuration as a JSON file and point the application to read it.

In AWS, the configuration data store of choice is AWS Parameter Store. It stores data precisely using the flattened scheme and to reproduce the application configuration some form of the unflatten algorithm is required. This is actually what prompted me to study jq in the first place.

It’s not difficult to do JSON massaging in Python, Perl, NodeJS or any scripting language like that. But since the unflattening algorithm is actually trivial, it is questionable why one should create a Docker image with hundreds of megabytes of extra junk just for manipulating JSON, especially if the main application is written in some non-scripting language like Java.

As a side note, refactoring the configuration fetching logic to the main application is a no-no, as it unnecessarily pulls in the AWS SDK dependency to the main application.

The deeper issue here is that it’s actually really difficult to realize the goal of hardening an application runtime to bare minimum. The scripting languages, with their CVEs and the AWS SDK, creep in very easily.

To tell you my honest opinion, we should all just script in Perl even though it sucks (I’ve actually never tried it). It’s there in practically every Docker image like it or not. In an ideal world, we could just use curl and jq for the configuration fetching step. But in the AWS land, that would raise the question of how to authenticate the curl call without external dependencies. Well, I do not know yet. I suppose I have to return to that in another blog post. Ta-ta for now!