Add to Technorati Favorites

jsongrep.py – Python for extracting pieces of JSON objects

02:10 November 25th, 2010 by terry. Posted under FluidDB, python. 18 Comments »

Lots of APIs these days return JSON objects. I love JSON, but reading a raw JSON dump can be awkward. You can print the whole thing using pprint in Python (or your favorite language), but what if you want to grep out and print only parts of the JSON? I was thinking tonight that it would be easy and useful to write a recursive script to do that. It took about half an hour to arrive at this solution:

#!/usr/bin/env python

import sys
import re
import json
from pprint import pprint

def jsongrep(d, patterns):
    try:
        pattern = patterns.pop(0)
    except IndexError:
        pprint(d)
    else:
        if isinstance(d, dict):
            keys = filter(pattern.match, d.keys())
        elif isinstance(d, list):
            keys = map(int,
                       filter(pattern.match,
                              ['%d' % i for i in range(len(d))]))
        else:
            if pattern.match(str(d)):
                pprint(d)
            return
        for item in (d[key] for key in keys):
            jsongrep(item, patterns[:])

if __name__ == '__main__':
    try:
        j = json.loads(sys.stdin.read())
    except ValueError, e:
        print >>sys.stderr, 'Could not load JSON object from stdin.'
        sys.exit(1)

    jsongrep(j, map(re.compile, sys.argv[1:]))

Usage is really simple. Let’s look at a couple of easy examples from the command line:

$ echo '{"hey" : "you"}' | jsongrep.py hey
u'you'

jsongrep.py has matched the “hey” key in the JSON object and printed its value. Let’s do the same thing with a 2-level JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey 
{u'there': u'you'}

Again, we see the entire object corresponding to the “hey” key. We can add another argument to drill down into the object

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey there
u'you'

As you might hope, you can use a regular expression for an argument:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py 'h.*' '.*'
u'you'

which in this case could have been given more concisely as

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py h .
u'you'

So you can drill down into nested dictionaries quite easily. When jsongrep.py runs out of patterns it just prints whatever’s left. A special case of this is if you give no patterns at all, you get the whole JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py
{u'hey': {u'there': u'you'}}

The regex patterns you pass on the command line are being matched against the keys of JSON objects (Python dicts). If jsongrep.py runs into a list, it will instead match against the list indices like so:

$ echo '{"hey" : { "joe" : ["blow", "xxx" ]}}' | jsongrep.py hey joe 1
u'xxx'

You can see we’ve pulled out just the first list element after matching “hey” and “joe”. So jsongrep.py regex args can be used to navigate your way through both JSON objects and lists.

Now let’s do something more interesting.

Twitter’s API can give you JSON, and the JSON is pretty chunky. For example, if I get my first 100 followers with this command:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones'

there’s 164Kb of output (try it and see). What if I just want the Twitter user names of the people who follow me? Looking at the JSON, I can see it starts with:

[{"profile_background_color":"131516","description":null

Hmm… looks like it’s a list of dictionaries. Let’s print just the first dictionary in the list:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py 0

which starts out:

{u'contributors_enabled': False,
 u'created_at': u'Wed Jul 19 00:29:58 +0000 2006',
 u'description': None,
 u'favourites_count': 0,
 u'follow_request_sent': False,
 u'followers_count': 178,
 u'following': False,
 u'friends_count': 67,
 u'geo_enabled': False,
 u'id': 2471,
 u'id_str': u'2471',
 u'lang': u'en',
 u'listed_count': 3,
 u'location': None,
 u'name': u'Roy',
 u'notifications': False,
 u'profile_background_color': u'131516',
 u'profile_background_image_url': u'http://s.twimg.com/a/1288470193/images/themes/theme14/bg.gif',
 u'profile_background_tile': True,
 u'profile_image_url': u'http://a3.twimg.com/profile_images/194788727/roy_with_phone_normal.jpg',
 u'profile_link_color': u'009999',
 u'profile_sidebar_border_color': u'eeeeee',
 u'profile_sidebar_fill_color': u'efefef',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'wasroykosuge',

and you can see a “screen_name” key in there which looks like what we want. Let’s see the first few:

$ curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py . screen_name | head
u'wasroykosuge'
u'Piiiu_piiiu'
u'350'
u'KardioFit'
u'jrecursive'
u'doodlesockingad'
u'revinprogress'
u'cloudjobs'
u'PointGcomics'
u'lucasbuchala'

Finally, here’s an example using FluidDB‘s new /values HTTP call. I’ll ask FluidDB for all objects matching the query has unionsquareventures.com/portfolio and from those matching objects I’ll pull back the value of the FluidDB tag named fluiddb/about. The result is JSON that starts out like this:

{"results": { "id" : {"93989942-b519-49b4-87de-ac834e6a6161": {"fluiddb/about": {"value": "http://www.outside.in"}}

You can see there’s a 5-level deep nesting of JSON objects. I just want the “value” key on all matching objects. Easy:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value | sort
u'http://amee.cc'
u'http://getglue.com'
u'http://stackoverflow.com'
u'http://tumblr.com'
u'http://www.10gen.com'
u'http://www.boxee.tv'
u'http://www.buglabs.net'
u'http://www.clickable.com'
u'http://www.cv.im'
u'http://www.disqus.com'
u'http://www.etsy.com'
u'http://www.flurry.com'
u'http://www.foursquare.com'
u'http://www.heyzap.com'
u'http://www.indeed.com'
u'http://www.meetup.com'
u'http://www.oddcast.com'
u'http://www.outside.in'
u'http://www.returnpath.net'
u'http://www.shapeways.com'
u'http://www.simulmedia.com'
u'http://www.targetspot.com'
u'http://www.tastylabs.com'
u'http://www.tracked.com'
u'http://www.twilio.com'
u'http://www.twitter.com'
u'http://www.workmarket.com'
u'http://www.zemanta.com'
u'http://zynga.com'

And there you have it, a sorted list of all Union Square Ventures portfolio companies, from the command line, as listed here.

jsongrep.py will also try to match on things that are not objects or lists if it runs into them, so we can refine this list a little. E.g.,

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*ee' | sort
u'http://www.meetup.com'
u'http://amee.cc'
u'http://www.indeed.com'
u'http://www.boxee.tv'

being the USV companies with “ee” somewhere in their URL. Or, for some advanced regex fun, the USV companies whose URLs don’t end in “.com”:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*(?

AddThis Social Bookmark Button

Facebook release Tornado and it’s not based on Twisted?

14:25 September 12th, 2009 by terry. Posted under FluidDB, python, tech, twisted. 17 Comments »

Image: Jay Smith

Image: Jay Smith

To their great credit, Facebook have just open-sourced more of their core software. This time it’s Tornado, an asynchronous web server written in Python.

Surely that can only mean one thing: Tornado is based on Twisted. Right?

Incredibly, no. Words fail me on this one. I’ve spent some hours today trying to put my thoughts into order so I could put together a reasonably coherent blog post on the subject. But I’ve failed. So here are some unstructured thoughts.

First of all, I’m not meaning to bash Facebook on this. At Fluidinfo we use their Thrift code. We’ll almost certainly use Scribe for logging at some point, and we’re seriously considering using Cassandra. Esteve Fernandez has put a ton of work into txAMQP to glue together Thrift, Twisted, and AMQP, and in the process became a Thrift committer.

Second, you can understand—or make an educated guess at—what happened: the Facebook programmers, like programmers everywhere, were strongly tempted to just write their own code instead of dealing with someone else’s. It’s not just about learning curves and fixing deficiencies, there are also issues of speed of changes and of control. At Fluidinfo we suffered through six months of maintaining our own set of related patches to Thrift before the Twisted support Esteve wrote was finally merged to trunk. That was painful and the temptation to scratch our own itch, fork, and forget about the official Thrift project was high.

Plus, Twisted suffers from the fact that the documentation is not as good as it could be. I commented at length on this over three years ago. Please read the follow-up posts in that thread for an illustration (one of many) of the maturity of the people running Twisted. Also note that the documentation has improved greatly since then. Nevertheless, Twisted is a huge project, it has tons of parts, and it’s difficult to document and to wrap your head around no matter what.

So you can understand why Facebook might have decided not to use Twisted. In their words:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

I’m betting it’s that last part that’s the key to the decision not to use Twisted.

But seriously…… WTF?

Twisted is an amazing piece of work, written by some truly brilliant coders, with huge experience doing exactly what Facebook set out to reinvent.

This is where I’m at a loss for words. I think: “what an historic missed opportunity” and “reinventing the wheel, badly” and “no, no, no, this cannot be” and “this is just so short-sighted” and “a crying shame” and many things besides.

Sure, Twisted is difficult to grok. But that’s no excuse. It’s a seriously cool and powerful framework, it’s vastly more sophisticated and useful and extensible than what Facebook have cobbled together. Facebook could have worked to improve twisted.web (which everyone agrees has some shortcomings) which could have benefitted greatly from even a small fraction of the resources Facebook must have put into Tornado. The end result would have been much better. Or Facebook could have just ignored twisted.web and built directly on top of the Twisted core. That would have been great too.

Or Facebook could have had a team of people who knew how to do it better, and produced something better than Twisted. I guess that’s the real frustration here – they’ve put a ton of effort into building something much more limited in scope and vision, and even the piece that they did build looks like a total hack built to scratch their short term needs.

What’s the biggest change in software engineering over the last decade? Arguably it’s the rise of test-driven development. I’m not the only one who thinks so. Yet here we are in late 2009 and Facebook have released a major piece of code with no test suite. Amazing. OK, you could argue this is a minor thing, that it’s not core to Tornado. That argument has some weight, but it’s hard to think that this thing is not a hack.

If you decide to use an asynchronous web framework, do you expect to have to manually set your sockets to be non-blocking? Do you feel like catching EWOULDBLOCK and EAGAIN yourself? Those sorts of things, and their non-portability (even within the world of Linux) are precisely the kinds of things that lead people away from writing C and towards doing rapid development using something robust and portable that looks after the details. They’re precisely the things Twisted takes care of for you, and which (at least in Twisted) work across platforms, including Windows.

It looks like Tornado are using a global reactor, which the Twisted folks have learned the hard way is not the best solution.

Those are just some of the complaints I’ve heard and seen in the Tornado code. I confess I’ve looked only superficially at their code – but more than enough to feel such a sense of lost opportunity. They built a small subsection of Twisted, they’ve done it with much less experience and elegance and hiding of detail than the Twisted code, and the thing doesn’t even come with a test suite. Who knows if it actually works, or when, or where, etc.?

And…. Twisted is so much more. HTTP is just one of many protocols Twisted speaks, including (from their home page): “TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others”.

Want to build a sophisticated, powerful, and flexible asynchronous internet service in Python? Use Twisted.

A beautiful thing about Twisted is that it expands your mind. Its abstractions (particularly the clean separation and generality of transports, protocols, factories, services, and Deferreds—see here and here and here) makes you a better programmer. As I wrote to some friends in April 2006: “Reading the Twisted docs makes me feel like my brain is growing new muscles.”

Twisted’s deferreds are extraordinarily elegant and powerful, I’ve blogged and posted to the Twisted mailing list about them on multiple occasions. Learning to think in the Twisted way has been a programming joy to me over the last 3 years, so you can perhaps imagine my dismay that a company with the resources of Facebook couldn’t be bothered to get behind it and had to go reinvent the wheel, and do it so poorly. What a pity.

In my case, I threw away an entire year of C code in order to use Twisted in FluidDB. That was a decision I didn’t take lightly. I’d written my own libraries to do lots of low level network communications and RPC – including auto-generating server and client side glue libraries to serialize and unserialize RPC calls and args (a bit like Thrift), plus a server and tons of other stuff. I chucked it because it was too brittle. It was too much of a hack. It wasn’t portable enough. It was too get the details right. It wasn’t extensible.

In other words….. it was too much like Tornado! So I threw it all away in favor of Twisted. As I happily tell people, FluidDB is written in Python so it can use Twisted. It was a question of an amazingly great asynchronous networking framework determining the choice of programming language. And this was done in spite of the fact that I thought the Twisted docs sucked badly. The people behind Twisted were clearly brilliant and the community was great. There was an opportunity to make a bet on something and to contribute. I wish Facebook had made the same decision. It’s everyone’s loss that they did not. What a great pity.

AddThis Social Bookmark Button

FluidDB domain names available early (and free) for Twitter users

23:04 January 24th, 2009 by terry. Posted under FluidDB, Fluidinfo, twitter. 30 Comments »

Sometime in the next few months, Fluidinfo will launch an alpha version of FluidDB, the database with the heart of a wiki. It’s a big engineering task, and there will still be a lot to do when we go into alpha, so we’ll initially only have a small number of applications being built on FluidDB.

But that doesn’t mean you can’t get into the action early.

Starting today, we’re pleased to offer FluidDB domains for free to Twitter users. This is perhaps the simplest way you’ll ever sign up for a new web service – if you’re a Twitter user:

Simply follow FluidDB on Twitter.

Yes, that’s it. You’re done.

Later, when we create your FluidDB domain, we’ll send you your FluidDB password via a direct message in Twitter. Note that we haven’t asked for your real name, your email, a password, sent you a cookie, or asked you to fill out a pesky sign-up form. The point here is simply to give you an early opportunity to trivially claim your preferred name.

Feel free to tweet the URL of this posting (http://bit.ly/bezc). You can follow me too for extra credit. If you’re not already a Twitter user and you want a free FluidDB domain name, sign up for Twitter, and then follow FluidDB.

Mini FAQ:

Why would I do this? By following FluidDB you will reserve your (Twitter) user name as your domain name in FluidDB.

Is there any charge? No.

What is a FluidDB domain? Sorry, but you’ll have to wait to find out the answer to this. We can tell you though that FluidDB domains will have many uses, and that they wont all be free.

What if I change my mind? Just unfollow FluidDB on Twitter.

Why Twitter? Because we like Twitter. We may do a similar thing for other services, allowing users to later claim their domain via OpenID, but that introduces the potential of naming conflicts.

Finally, please note that we can’t give an iron-clad guarantee that you’ll get your Twitter user name as your FluidDB domain name, but we’ll do our best. At this early stage of the game, we reserve the right to do whatever we want :-)

AddThis Social Bookmark Button