Introduction
Oh, JSON, JSON, JSON…what would the world be without you? Let me show you what I have learned recently about this ubiquitous data-interchange format. It turns out you can manipulate complex JSON strings on the command line with ease. This came to me as somewhat of a surprise as I always thought that JSON is fundamentally incompatible with traditional command line tools, the Unix filters cat, grep, sed and friends, that operate on ad hoc structured text files in a line-oriented fashion.
As a demonstration, let’s write a script to flatten and unflatten JSON objects. That is, let’s figure out a flattening algorithm for mapping inputs like this
{
"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{ "value": "New", "onclick": "CreateNewDoc()" },
{ "value": "Open", "onclick": "OpenDoc()" },
{ "value": "Close", "onclick": "CloseDoc()" }
]
}
}
}
to a representation that encodes nested structures like this
[
{
"Key": "menu/id",
"Value": "file"
},
{
"Key": "menu/value",
"Value": "File"
},
{
"Key": "menu/popup/menuitem/0/value",
"Value": "New"
},
{
"Key": "menu/popup/menuitem/0/onclick",
"Value": "CreateNewDoc()"
},
{
"Key": "menu/popup/menuitem/1/value",
"Value": "Open"
},
{
"Key": "menu/popup/menuitem/1/onclick",
"Value": "OpenDoc()"
},
{
"Key": "menu/popup/menuitem/2/value",
"Value": "Close"
},
{
"Key": "menu/popup/menuitem/2/onclick",
"Value": "CloseDoc()"
}
]
The inverse of the flatten algorithm is naturally unflatten. It takes a flattened JSON and reproduces the original nested structure.
JSON juggling of the above kind occurs quite frequently in the world of programming so this exercise should have some practical usefulness as well.
Normally I would bang out a NodeJS script that gets the job done. After all, JSON stands for JavaScript Object Notation. Then there is jq that advertises itself as “a lightweight and flexible command-line JSON processor.” I was aware of its existence but never gave it too much of attention.
Turns out jq frigging rocks! It seems to be an extremely well-designed piece of software. It fulfills a precise need and feels instantaneously familiar. As an added bonus, it only weighs 1MB, making it ideal in the Docker game where minimalistic runtimes are key. But let’s not get ahead of ourselves, let’s first see how jq fares with the programming challenge we posed for ourselves.
Flatten
The flatten algorithm is a one-liner in jq:
def json_flatten:
[ paths(scalars) as $path | { Key: $path | join("/"), Value: getpath($path) } ]
;
Obviously, this solution did not just spring into my mind immediately. There were many intermediate solutions and false starts. But in a way that is precisely what makes filter-based languages like jq so elegant. You can just start from anywhere and try a filter that does something potentially useful. Stuck in a dead-end? Just revert back the previous filter and take it from there. Optimization amounts to reducing the number of filter stages.
You just have to take the correct initial step and the algorithm will emerge on its own during the process. What is more, the code reads like a mathematical proof which it ultimately is, following a top-down structure and leaving past a trail of intermediate results, each with a well-defined input and output.
Anyway, I had the hunch that the flattening algorithm is going to be very simple
— it had all the smells of it. Reading the jq manual, the functions
paths(PATTERN)
and getpath(PATHS)
seemed like useful little nuggets, and it
turns out they get the job done. You may
RTFM, too. Their use above should give
you a clue what they do.
Unflatten
The unflatten algorithm is similarly concise
def json_unflatten:
map({ Path: .Key | split("/") | map(tonumber? // .), Value: .Value })
| reduce .[] as $item ({}; setpath($item.Path; $item.Value))
;
Here we can showcase how jq operates. The first map
operates on array input
[{ Key, Value }, { Key, Value }...]
. We produce a path expression by inverting
the join
call with the split
call. This maps menu/popup/menuitem/0/value
to ["menu", "popup", "menuitem", "0", "value"]
. However, there is now one
glitch. For jq
to reconstruct the original JSON, the index "0"
should be
converted to 0
. This is what map(tonumber? // .)
does.
The reduce
stage then starts to update an empty object {}
in-place with the
setpath
function. As the flattening scheme produces an unique path expression
for each property, every setpath
call will recover one new property of the
final output:
{ "menu": { "id": "file" }
{ "menu": { "id": "file", "value": "File" } }
{ "menu": { "id": "file", "value": "File", "popup": { "menuitem": [{ "value": "New" }] } } } }
...
Discussion
Everything sure looks a-OK, but there is one major pitfall in the solution. In
mathematical terms, the json_flatten
function is not an
isomorphism between arbitrary
JSON strings and flattened JSONs. In plain speak, there will be buggy inputs.
To start with, consider the JSON
{
"root": {
"some/path": true
}
}
From the point of view of json_flatten
, it is equivalent to
{
"root": {
"some": {
"path": true
}
}
}
This problem could be fixed by URL-encoding the forward slash. However, as of version 1.6, there’s no support for URL-decoding. An open GitHub issue about this limitation exists.
Then there is a far more serious issue. The JSON { "root": { "0": true } }
and
{ "root": [true] }
are also equivalent inputs, and there is nothing that can
be done about this without changing the specification of the flattening scheme.
This is problematic as now there are inputs like this
{
"root": {
"1e12": "one malloc(gazillion) plz"
}
}
Sane JavaScript engines fall back to representing sparse arrays as hashmaps,
saving us from a heap overflow. Do you dare to run jq 'json_flatten | json_unflatten'
on that input?
What to do? Well, Pinky, the same we do every time. We validate our input. In mathematical speak, we redefine the domain of the mapping. After all, numerical and path-like object keys are cornercases so it’s perfectly fine to just reject inputs containing them.
The end result flatten.jq
with the above considerations reads in its entirety
Notice the shebang line. You
may make the above snippet an executable and invoke it as ./flatten.jq FILE
or
./flatten.jq FILE --args unflatten
.
Where could this be useful? Like mentioned in the introduction, one place where shell scripting is still required is in writing glue code for apps running in Docker containers. A typical startup sequence of a Docker container is made of first fetching the application configuration from some data store as JSON and then starting the application. A strict Twelve-Factor App approach requires that the configuration be exported to the application environment as environment variables. However, for most cases it suffices or is actually safer to just save the configuration as a JSON file and point the application to read it.
In AWS, the configuration data store of choice is AWS Parameter Store. It stores
data precisely using the flattened scheme and to reproduce the application
configuration some form of the unflatten algorithm is required. This is
actually what prompted me to study jq
in the first place.
It’s not difficult to do JSON massaging in Python, Perl, NodeJS or any scripting language like that. But since the unflattening algorithm is actually trivial, it is questionable why one should create a Docker image with hundreds of megabytes of extra junk just for manipulating JSON, especially if the main application is written in some non-scripting language like Java.
As a side note, refactoring the configuration fetching logic to the main application is a no-no, as it unnecessarily pulls in the AWS SDK dependency to the main application.
The deeper issue here is that it’s actually really difficult to realize the goal of hardening an application runtime to bare minimum. The scripting languages, with their CVEs and the AWS SDK, creep in very easily.
To tell you my honest opinion, we should all just script in Perl even though it
sucks (I’ve actually never tried it). It’s there in practically every Docker
image like it or not. In an ideal world, we could just use curl
and jq
for
the configuration fetching step. But in the AWS land, that would raise the
question of how to authenticate the curl
call without external dependencies.
Well, I do not know yet. I suppose I have to return to that in another blog
post. Ta-ta for now!