Posted Thursday, August 9th, 2012 at 8:32 pm under python, tech.

describejson – a Python script for summarizing JSON structure

Yesterday I was sent a 24M JSON file and needed to look through it to give someone an opinion on its contents. I did what I normally do to look at JSON, piping it into python -m json.tool. The result looked pretty good, I scrolled through some screens with a single long list and jumped to the bottom to see what looked like the end of the list. What I didn’t know at the time was that the output was 495,647 lines long! And there was some important stuff in the middle of the output that I didn’t see at all.

So I decided to write a quick program to recursively summarize JSON. You can grab it from Github at https://github.com/terrycojones/describejson.

Usage is simple, just send it JSON on stdin. Here’s an example input:

{
  "cats": 3,
  "dogs": 6,
  "parrots": 1
}

Which gets you this output:

$ python describejson.py < input.json
1 dict of length 3. Values:
  3 ints

The output is a little cryptic, but you’ll get used to it (and I may improve it). In words, this is telling you that (after loading the JSON) the input contained 1 Python dictionary with 3 elements. The values of the 3 elements are all integers. The indentation is meaningful, of course. You can see that the script is summarizing the fact that the 3 values in the dict were all of the same type.

Here’s another sample input:

[
  ["fluffy", "kitty", "ginger"],
  ["fido", "spot", "rover"],
  ["squawk"]
]

Which gets you:

$ python describejson.py < input.json
1 list of length 3. Values:
  2 lists of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

In words, the input was a list of length 3. Its contents were 2 lists of length 3 that both contained 3 unicode strings, and a final list that contains just a single unicode string.

Specifying equality strictness

The script currently takes just one option, --strictness (or just -s) to indicate how strict it should be in deciding whether things are “the same” in order to summarize them. In the above output, the default strictness length is used, so the script considers the first two inner lists to be collapsible in the summary, and prints a separate line for the last list since it’s of a different length. Here’s the output from running with --strictness type:

$ python describejson.py –strictness type < input.json
1 list of length 3. Values:
  3 lists of length 3. Values:
    3 unicodes

The lists are all considered equal here. The output is a little misleading, since it tells us there are 3 lists of length 3, each containing 3 unicodes. I may fix that.

We can also be more strict. Here’s the output from --strictness keys:

$ python describejson.py –strictness keys < input.json
1 list of length 3. Values:
  1 list of length 3. Values:
    3 unicodes
  1 list of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

The 3 inner lists are each printed separately because their contents differ. The keys argument is also a bit confusing for lists, it just means the list values. It’s clearer when you have dictionaries in the input.

This input:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "d": 4,
    "e": 5,
    "f": 6
  }
]

produces

$ python describejson.py < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

I.e., one list, containing 2 dictionaries, each containing 3 int values. Note that this is using the default of --strictness length so the two dicts are considered the same. If we run that input with strictness of keys, we’ll instead get this:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

The dicts are considered different because their keys differ. If we change the input to make the keys the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

and run again with --strictness keys, the dicts are considered the same:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

but if we use --strictness equal, the dicts will be considered different:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

Finally, making the dicts the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 1,
    "b": 2,
    "c": 3
  }
]

and running with --strictness equal will collapse the summary as you’d expect:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

Hopefully it’s clear that by being less strict on matching you’ll get more concise output in which things are casually considered “the same” and if you’re more strict you’ll get more verbose output, all the way to using strict equality for both lists and dicts.

Here’s the full set of --strictness options:

  • type: compare things by type only.
  • length: compare lists and objects by length.
  • keys: compare lists by equality, dicts by keys.
  • equal: compare lists and dicts by equality.

Improvements

The naming of the --strictness options could be improved. The keys option should probably be called values (but that is confusing, since dictionaries have values and it’s a comparison based on their keys!). A values option should probably also compare the value of primitive things, like integers and strings.

There are quite a few other things I might do to this script, if I ever have time. It would be helpful to print out some keys and values when these are short and unchanging. It would be good to show an example representative value of something that repeats (modulo strictness) many times. It might be good to be able to limit the depth to go into a JSON structure.

Overall though, I already find the script useful and I’m not in a rush to “improve” it by adding features. You can though :-)

You might also find it helpful to take what you learn about a JSON object via describe JSON and use that to grep out specific pieces of the structure using jsongrep.py.

If you’re curious, here’s the 24-line output summary of the 24M JSON I received. Much more concise than the nearly 1/2 a million lines from python -m json.tool:

1 dict of length 3. Values:
  1 int
  1 dict of length 4. Values:
    1 list of length 17993. Values:
      17993 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
    1 list of length 0.
    1 list of length 11907. Values:
      11907 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 1. Values:
          1 unicode
        2 unicodes
    1 list of length 28068. Values:
      28068 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
  1 unicode
  • http://blogs.fluidinfo.com/terry terrycojones

    Hi Nelson! Thanks :-) I hardly ever let myself do things like this…. only when I think something will be useful or cute, and take at most a few hours. I wish I had more time to be more playful (with code) these days.

  • Anonymous

    Nice!