jsongrep.py – Python for extracting pieces of JSON objects
Lots of APIs these days return JSON objects. I love JSON, but reading a raw JSON dump can be awkward. You can print the whole thing using pprint in Python (or your favorite language), but what if you want to grep out and print only parts of the JSON? I was thinking tonight that it would be easy and useful to write a recursive script to do that. It took about half an hour to arrive at this solution:
#!/usr/bin/env python
import sys
import re
import json
from pprint import pprint
def jsongrep(d, patterns):
try:
pattern = patterns.pop(0)
except IndexError:
pprint(d)
else:
if isinstance(d, dict):
keys = filter(pattern.match, d.keys())
elif isinstance(d, list):
keys = map(int,
filter(pattern.match,
['%d' % i for i in range(len(d))]))
else:
if pattern.match(str(d)):
pprint(d)
return
for item in (d[key] for key in keys):
jsongrep(item, patterns[:])
if __name__ == '__main__':
try:
j = json.loads(sys.stdin.read())
except ValueError, e:
print >>sys.stderr, 'Could not load JSON object from stdin.'
sys.exit(1)
jsongrep(j, map(re.compile, sys.argv[1:]))
Usage is really simple. Let’s look at a couple of easy examples from the command line:
$ echo '{"hey" : "you"}' | jsongrep.py hey
u'you'
jsongrep.py has matched the “hey” key in the JSON object and printed its value. Let’s do the same thing with a 2-level JSON object:
$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey
{u'there': u'you'}
Again, we see the entire object corresponding to the “hey” key. We can add another argument to drill down into the object
$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey there
u'you'
As you might hope, you can use a regular expression for an argument:
$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py 'h.*' '.*'
u'you'
which in this case could have been given more concisely as
$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py h .
u'you'
So you can drill down into nested dictionaries quite easily. When jsongrep.py runs out of patterns it just prints whatever’s left. A special case of this is if you give no patterns at all, you get the whole JSON object:
$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py
{u'hey': {u'there': u'you'}}
The regex patterns you pass on the command line are being matched against the keys of JSON objects (Python dicts). If jsongrep.py runs into a list, it will instead match against the list indices like so:
$ echo '{"hey" : { "joe" : ["blow", "xxx" ]}}' | jsongrep.py hey joe 1
u'xxx'
You can see we’ve pulled out just the first list element after matching “hey” and “joe”. So jsongrep.py regex args can be used to navigate your way through both JSON objects and lists.
Now let’s do something more interesting.
Twitter’s API can give you JSON, and the JSON is pretty chunky. For example, if I get my first 100 followers with this command:
curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones'
there’s 164Kb of output (try it and see). What if I just want the Twitter user names of the people who follow me? Looking at the JSON, I can see it starts with:
[{"profile_background_color":"131516","description":null
Hmm… looks like it’s a list of dictionaries. Let’s print just the first dictionary in the list:
curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py 0
which starts out:
{u'contributors_enabled': False,
u'created_at': u'Wed Jul 19 00:29:58 +0000 2006',
u'description': None,
u'favourites_count': 0,
u'follow_request_sent': False,
u'followers_count': 178,
u'following': False,
u'friends_count': 67,
u'geo_enabled': False,
u'id': 2471,
u'id_str': u'2471',
u'lang': u'en',
u'listed_count': 3,
u'location': None,
u'name': u'Roy',
u'notifications': False,
u'profile_background_color': u'131516',
u'profile_background_image_url': u'http://s.twimg.com/a/1288470193/images/themes/theme14/bg.gif',
u'profile_background_tile': True,
u'profile_image_url': u'http://a3.twimg.com/profile_images/194788727/roy_with_phone_normal.jpg',
u'profile_link_color': u'009999',
u'profile_sidebar_border_color': u'eeeeee',
u'profile_sidebar_fill_color': u'efefef',
u'profile_text_color': u'333333',
u'profile_use_background_image': True,
u'protected': False,
u'screen_name': u'wasroykosuge',
and you can see a “screen_name” key in there which looks like what we want. Let’s see the first few:
$ curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py . screen_name | head
u'wasroykosuge'
u'Piiiu_piiiu'
u'350'
u'KardioFit'
u'jrecursive'
u'doodlesockingad'
u'revinprogress'
u'cloudjobs'
u'PointGcomics'
u'lucasbuchala'
Finally, here’s an example using FluidDB‘s new /values HTTP call. I’ll ask FluidDB for all objects matching the query has unionsquareventures.com/portfolio and from those matching objects I’ll pull back the value of the FluidDB tag named fluiddb/about. The result is JSON that starts out like this:
{"results": { "id" : {"93989942-b519-49b4-87de-ac834e6a6161": {"fluiddb/about": {"value": "http://www.outside.in"}}
You can see there’s a 5-level deep nesting of JSON objects. I just want the “value” key on all matching objects. Easy:
curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value | sort
u'http://amee.cc'
u'http://getglue.com'
u'http://stackoverflow.com'
u'http://tumblr.com'
u'http://www.10gen.com'
u'http://www.boxee.tv'
u'http://www.buglabs.net'
u'http://www.clickable.com'
u'http://www.cv.im'
u'http://www.disqus.com'
u'http://www.etsy.com'
u'http://www.flurry.com'
u'http://www.foursquare.com'
u'http://www.heyzap.com'
u'http://www.indeed.com'
u'http://www.meetup.com'
u'http://www.oddcast.com'
u'http://www.outside.in'
u'http://www.returnpath.net'
u'http://www.shapeways.com'
u'http://www.simulmedia.com'
u'http://www.targetspot.com'
u'http://www.tastylabs.com'
u'http://www.tracked.com'
u'http://www.twilio.com'
u'http://www.twitter.com'
u'http://www.workmarket.com'
u'http://www.zemanta.com'
u'http://zynga.com'
And there you have it, a sorted list of all Union Square Ventures portfolio companies, from the command line, as listed here.
jsongrep.py will also try to match on things that are not objects or lists if it runs into them, so we can refine this list a little. E.g.,
curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*ee' | sort
u'http://www.meetup.com'
u'http://amee.cc'
u'http://www.indeed.com'
u'http://www.boxee.tv'
being the USV companies with “ee” somewhere in their URL. Or, for some advanced regex fun, the USV companies whose URLs don’t end in “.com”:
curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*(?
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.
November 25th, 2010 at 4:46 am
Nice! Thanks. I’ll use this.
November 25th, 2010 at 6:54 am
Great, but maybe you should get rid of the unicode stuff so that it can be piped into something else.
November 25th, 2010 at 7:45 pm
Elegant code!
November 25th, 2010 at 8:34 pm
Very nice! Also in the same vein: Jsawk: https://github.com/micha/jsawk and Json-command: https://github.com/zpoley/json-command
November 26th, 2010 at 3:33 am
[…] filed under FluidDB, python. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own […]
February 16th, 2011 at 6:47 am
I the single quotes ‘ are being ruined(’) by whatever syntax highlighter code ur using… so it is harder to copy paste code from this blog
February 16th, 2011 at 6:47 am
I the single quotes ‘ are being ruined(’) by whatever syntax highlighter code ur using… so it is harder to copy paste code from this blog
February 16th, 2011 at 2:39 pm
Oops! Thanks Sudarshan – fixed.
February 16th, 2011 at 4:01 pm
Love it. Not so keen on the python syntax output. Why not just output unicode strings instead of wrapping them in u” ?
February 16th, 2011 at 5:22 pm
ah, I think instead of “pprint(d)” you could put in “json.dumps(d)”
February 18th, 2011 at 1:10 pm
Those who found jsongrep interesting may find http://kmkeen.com/jshon/ interesting… Also check out http://news.ycombinator.com/item?id=2234767
September 1st, 2011 at 10:48 pm
cool blog – keep up the writing sytle….
This is a great post – best I’ve seen in a while….
August 28th, 2012 at 12:03 am
Very elegant. I’m using it as a pattern for a jsondiff tool.
August 28th, 2012 at 12:22 am
Hi Hobson – thanks for dropping by :-) Post a link to your diff code when you’re done. I did another JSON thing recently too, at http://blogs.fluidinfo.com/terry/2012/08/09/describejson-a-python-script-for-summarizing-json-structure/
September 10th, 2013 at 5:52 pm
Flipping awesome! I was going to have to do something silly like pull it into a csv ( I may still have to..)
Thanks Mate!
September 10th, 2013 at 7:54 pm
You’re welcome :-)
March 9th, 2015 at 3:50 pm
Hi Terry, thanks for this useful snippet, you should post up on github.
IMHO I think describejson.py and jsongrep.py could be usefully combined into something perhaps named jsonreporter.py.
I am thinking of a tool to explore structure/content of json, composing groups of grep/regex to be baselined, resulting in a tool used by others to view specific instances of JSON.
March 9th, 2015 at 4:21 pm
Hi Quentin. One of them is on GitHub, at https://github.com/terrycojones/describejson and I just created https://github.com/terrycojones/jsongrep for you. If you end up writing a JSON reporter along the lines you describe, I’d be happy to hear about it.
Thanks for commenting!