Add to Technorati Favorites

Secure per-site passwords with no encrypted blob

16:45 February 3rd, 2013 by terry. Posted under python, tech. | 16 Comments »

Last night I read a Guardian article, Online passwords: keep it complicated. It’s a surprisingly good summary, given that it’s aimed at the general public. The author concludes by telling us he decided to adopt LastPass, and also mentions 1Password. The comments section on the page gives similar solutions, like KeePass. The security-conscious people I know have arrived at the same conclusion. There are plenty of articles on the web that summarize various similar products, e.g., Which Password Manager Is The Most Secure? at LifeHacker. A single encrypted blob offers good security and works well in practice. It also allows the storage of information like name, address, credit cards, etc., that can be used to auto-fill web forms.

But… I’ve never liked the idea of a single encrypted file with all my passwords in it. What if the storage is lost or corrupted? Could the file someday be decrypted by someone else? If my encrypted blob is online, what happens when I am offline? If the blob is to be stored locally, I need to think about where to put it, how to back it up, etc. If a company holds it for me, what happens if they go out of business or get hacked? What if they use proprietary encryption, a closed-source access app, or a proprietary underlying data format? Not all the above solutions have all these issues, but they all have some of them.

The crucial thing they all have in common is that they use a master password to encrypt all your passwords into a single blob, and the blob has to then be reliably stored and accessible forever.

An approach that requires no storage

I realized there’s a solution that doesn’t require any storage. It’s not perfect, but it has some attractive properties. I’ve already started using it for some sites.

[Edit: it has been pointed out in the comments that the following solution has been thought of before. See SuperGenPass.]

Here’s a simple first version of Python code to print my password for a given service:

import base64, getpass, hashlib, sys

service = sys.argv[1]
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())[:32]

print 'Your password on %s is %s' % (service, password)

The service name is given on the command line. The code prints a 32-character password for use on that service. Here’s some sample output:

        $ mypass.py facebook
        Enter your master password: 
        Your password on facebook is Wza2l5Tqy0omgWP+5DDsXjQLO/Mc07N8

        $ mypass.py twitter
        Enter your master password: 
        Your password on twitter is eVhhpjJrmtSa8XnNMu6vLSDhPeO5nFOT

This has some nice advantages. It also places some small requirements on the user. Unfortunately, however, it is not generally applicable – at least not today. These are discussed below.

Advantages

The obvious advantage is that there is no external storage. Your passwords are not stored anywhere. There’s no blob to store, protect, access, backup, worry about, etc. The algorithm used to generate your password is dead-simple, it’s open and available in dozens of languages, and it’s secure.

You’re free to use more than one master password if you like. You can invent your own services names (more on which below).

Requirements / User burden

As with all one-key-to-unlock-them-all approaches, the user obviously needs to remember their master password.

With this approach though, the user also has to remember the name they used for the service they’re trying to access. If you create a password for a service called “gmail” you’ll need to use that exact service name in the future. For me that’s not much of a burden, but I guess it would be for others.

There’s no reason why the list of services you have passwords for couldn’t be stored locally. If the password generator were in a browser extension, it could possibly suggest a service name, like “facebook.com”, based on the domain of the page you were on.

With this approach, it’s even more important that one’s master password be hard to guess. Unlike the single-encrypted-blob approach, anyone who can guess your master password (and the names you use for services) can immediately obtain your passwords. They don’t also need access to the blob – it doesn’t exist.

Additional security can be easily had by, for example, using a convention of adding a constant character to your service names. So, e.g., you could use “facebook*” and “twitter*” as service names, and not tell anyone how you form service names.

General applicability

Unfortunately, there is a major problem with this approach. That’s because different sites have different requirements on passwords. Some of the difficulties can be avoided quite easily, but there’s an additional problem, caused when services change their password policy.

The above code generates a Base64 password string. So, to give some examples, if the service you want a password for doesn’t allow a plus sign in your password, the above code might make something unacceptable to the service. Same thing if they insist that passwords must be at most 12 characters long.

Ironically, these services are insisting on policies that prevent the use of truly secure passwords. They’re usually in place to ensure that short passwords are chosen from a bigger space. It would be better, though more work, to impose restrictions only on short passwords.

In a perfect world, all sites could immediately switch to allowing Base64 passwords of length ≥ 16 (say). Then the above approach would work everywhere and we’d be done.

Varying password length

A general approach to adjusting the generated password is to take some of the Base64 information produced and use it to modify the password. For example, you might not comfortable with all your passwords being the same length, so we can compute a length like this:

import base64, getpass, hashlib, string, sys

b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(password[-5]) % 16
print 'Your password on %s is %s' % (service, password[0:16 + lenAdjust])

This generates passwords that are between 16 and 31 characters in length:

        $ ./length-varying.py facebook
        Enter your master password: 
        Your password on facebook is 1nTlVGPhuWZf0l9Sk27
        $ ./length-varying.py twitter
        Enter your master password: 
        Your password on twitter is WE1DVZHAFBx2c3g63tR+Oi3Jxs4xMV

Satisfying site requirements

A possible approach to dealing with per-site password requirements is to have the code look up known services and adjust the initial password it generates to be acceptable. This can easily be done in a secure, random, repeatable way. For example:

  • If a site doesn’t allow upper case, lowercase the password.
  • If a site doesn’t allow digits, replace them with random letters.
  • If a site requires punctuation, you can replace some initial letters in the password with randomly chosen punctuation and then randomly permute the result using the Knuth Shuffle.

Some of these transformations use random numbers. These are easy to obtain: take an unused part of the Base64 string and use it to seed a RNG. For each transformation, you would need to call the RNG a fixed number of times, i.e., independent of the number of random numbers actually used to perform the transformation. That’s necessary in order to keep the RNG in a known state for subsequent transformations (if any).

For example, the following replaces digits 0-9 with a letter from A-J whose case is chosen randomly:

import base64, getpass, hashlib, random, string, sys

def getSeed(chars):
    seed = ord(chars[0])
    for letter in chars[1:]:
        seed = (seed << 8) & ord(letter)
    return seed

service = sys.argv[1]
b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
digest = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(digest[-5]) % 16

passwordWithDigits = digest[0:16 + lenAdjust]
password = ''

random.seed(getSeed(digest[32:36]))
randoms = [random.randint(0, 1) for _ in passwordWithDigits]

for index, letter in enumerate(passwordWithDigits):
    if letter in '0123456789':
        base = ord('a' if randoms[index] else 'A')
        replacement = chr(base + ord(letter) - ord('0'))
        password += replacement
    else:
        password += letter

print 'Your password on %s is %s' % (service, password)

Here's the output for the above two services (using the same master password):

        $ ./no-digits.py facebook
        Enter your master password: 
        Your password on facebook is bnTlVGPhuWZfAljSkch
        $ ./no-digits.py twitter
        Enter your master password: 
        Your password on twitter is WEBDVZHAFBxccdgGdtR+OidJxsExMV

As you can see, the digits are replaced with letters (in a biased way, that we can ignore in an informal blog post). The RNG is in a known state because the number of times it has been called is independent of the number of digits in the pre-transformation text.

This approach can be used to transform the initial random sequence into one that satisfies a service's password restrictions.

It is difficult to reliably associate service names with site policies. To do so might require keeping a file mapping the name a user used for a service to the policy of the site. Although this doesn't defeat the purpose of this approach (since that file would not need to be stored securely), it is an additional and unwanted pain for the user. Part of the point was to try to entirely avoid additional storage, even if it doesn't have to be encrypted.

The major problem with per-site requirements

The major problem however is that sites may change their password policy. Even if our program knew the rules for all sites, it would have a real problem if a site changed its policy. The code would need to be updated to generate passwords according to the new site policy. Existing users, supposing they upgraded, would then be shown incorrect passwords and would need to do password resets, which is obviously inconvenient.

Conclusion

I like the above approach a lot, but don't see a way to solve the issue with changing site policies. I wouldn't mind building in some rules for known popular sites, but any step in that direction has its problems - at least as far as I can see.

For now, I'm going to start using the above approach on sites that allow a long random password with characters from the Base64 set. That covers the majority of sites I use. Importantly, that includes Google, so if I ever need password resets I can have them sent there, knowing that I can always log in to recover them.


CEL, a Chrome Event Logger

16:50 January 27th, 2013 by terry. Posted under browser extensions, tech. | Comments Off on CEL, a Chrome Event Logger

logo-128Last night I wrote CEL, a Chrome Event Logger, a Google Chrome extension that logs all known chrome.* API events to the Javascript console. Example use cases are:

  • You wonder if there is a Chrome API event that’s triggered for some action you take in the browser. Rather than guessing what the event might be and trying to find it in the API docs, you can enable CEL, perform the action in Chrome, and see what CEL logs.
  • You’re writing an extension and are unsure about whether an event is being triggered or with what arguments. Instead of adding an event listener in your own code and reloading your extension, you can just look in the CEL log.

Click here to install.

Viewing the logging

Once installed, you can examine the CEL logging by visiting chrome://extensions, clicking to enable Developer mode, and then clicking the link next to the CEL icon where it says Inspect views: _generated_background_page.html

Manually adjusting logging

In the JS console for the extension’s background page, there are several commands you can run to adjust what is logged:

// Return a list of the chrome.* API events being logged. 
CEL.enabled()

// Return a list of the chrome.* API events being ignored.
CEL.disabled()

// Enable logging of some calls (see below).
CEL.enable(name1, name2, ...)

// Disable logging of some calls (see below).
CEL.disable(name1, name2, ...)

The names you pass to CEL.enable and CEL.disable can be individual API calls (without the leading “chrome.”), or can be higher-level categories. Here are some examples:

// Enable chrome.tabs.onCreated and all chrome.webRequest.* events:
CEL.enable('tabs.onCreated', 'webRequest')

// Disable all chrome.tabs.* events and chrome.webNavigation.onCommitted
CEL.disable('tabs', 'webNavigation.onCommitted')

Note that CEL.enable will enable all necessary higher level logging. So, for example, if you call CEL.enable('omnibox.onInputEntered') all chrome.omnibox.* events (that have not been explicitly disabled) will be logged. If you don’t want to enable and disable groups of calls in this way, always pass explicit API calls.

CEL.disabled will show you the names of individual calls that are disabled, as well as any disabled higher levels.

Global enable / disable

The extension provides a context menu item that lets you globally enable or disable logging.

Installation from the Chrome web store

Go to http://bit.ly/chrome-event-logger which will redirect you to the CEL page in the CWS.

Tracking the development version

If you want the development version, you can install the extension by visting https://fluiddb.fluidinfo.com/about/chrome-event-logger/fluidinfo.com/chrome.crx. Chrome will warn you that extensions cannot be installed from non-Chrome Web Store URLs but will download the .crx file in any case. Open chrome://extensions and drag the .crx file you just downloaded onto that page.

Installation from source

The source to the extension is available on Github. Here’s how to clone and install it.

  • Download the repo: git clone http://github.com/terrycojones/chrome-event-logger
  • In Chrome, go to chrome://extensions
  • Click Developer mode
  • Click Load Unpacked Extension…
  • Navigate to the directory where you cloned the repo and click Open

A chrome extension for examining tab events and ids

17:31 December 19th, 2012 by terry. Posted under browser extensions, programming. | Comments Off on A chrome extension for examining tab events and ids

Yesterday I was on a call with a friend who told me that when he enters a URL into an existing Chrome tab, the tab id changes. He asked if I’d ever seen that happening, and I said no. I told him his code was probably to blame :-)

Anyway, I wrote a quick Chrome extension, called Tabsanity, to log all 8 tab events with the tab ids, as well as to run a simple sanity check on tab ids after every tab event.

All the action is in the Javascript console for the background page.

To see if you’ve got the issue my friend has, open a tab and go to http://en.wikipedia.org/wiki/Virtual_private_network. In the JS console you’ll see the tab id. Now go to the URL location bar, enter nytimes.com, and go to that URL. If Chrome is behaving properly for you, the tab id involved wont change. If you have the issue, the console log will show you that Chrome (quickly) removes the existing tab, creates a new one, and loads the nytimes page – resulting in a different tab id. We were both running Chrome 23.0.1271.101 on a MacBook Air. The same behavior happens in Incognito Mode with all other extensions disabled, and regular mode.

You can install from this link or get the source on Github.


Omit needless parens

15:41 December 18th, 2012 by terry. Posted under programming, python. | 9 Comments »

The famous 17th commandment in The Elements of Style is “Omit needless words”.

There should be an equivalent in programming, but for parentheses. Every time I see needless parens in a program I want to rip them out (unless they’re obviously there for formatting/readability reasons).

Community service message: Omit needless parens. When in doubt whether parens are needed, look up the precedence rules for the operators involved and only use parens if the default isn’t what you want.

Here’s why you shouldn’t use needless parens:

  • The #1 reason is that you’re making your code more difficult to read for people who know the language better than you do. A more experienced programmer will see a red flag and look at your code more carefully than necessary because they will be trying to figure out why you used the extra parens and if there’s something non-obvious going on. When I come across code like that, I usually conclude that whoever wrote the code doesn’t know the language that well. My opinion of the code goes down. My reading speed goes down too because the needless parens, in my estimation, indicate an increased likelihood that programmer has done other (worse) things elsewhere.
  • You don’t want to appear incompetent or lazy, or to slow down or put off people reading your code, right? (See above.)
  • Putting in needless parens is heading down a slippery slope. How many levels of extra parens should you stop at? The only clear cut rule that makes sense is to stop at zero.
  • If you pause to look up the precedence rules, you’ll make yourself a better programmer in the language in question. You’ll be able to read other people’s needless-parenthesis-free code with no problem. You can pen lofty holier-than-thou blog posts like this one.

Back in about 1985 I wrote a tiny shell script to print out operator precedence and associativity rules for C. When I started programming in Perl, I wrote one for it. Then one for Python and later one for Javascript. For your convenience, and as a reward for reading, I just stuck the 4 scripts up on Github.


Destructive, invasive, and dangerous behavior by UK ISP TalkTalk (aka StalkStalk)

00:45 December 5th, 2012 by terry. Posted under companies, tech. | 6 Comments »

Today I spent several hours trying to figure out what was going wrong with a web service I’ve been building. The service uses websockets to let browsers and the server send messages to each other over a connection that is held open.

I built and tested the service locally and it worked fine. But when I deployed it to a remote server, the websocket connections were mysteriously dropped 30-35 seconds after they were established. The error messages on the server were cryptic, as were those in the browser. Google came to the rescue, and led me to the eye-opening explanation

It turns out TalkTalk, my ISP, are also fetching the URLs my browser fetches, after a delay of 30-35 seconds! I guess they’re not doing it with all URLs, probably because they figure the sites are “safe” and not full of content that might be deemed objectionable. All the accesses come from a single IP address (62.24.252.133), and if you block that address or otherwise deny its connection attempts, the websocket problem goes away immediately.

Dear TalkTalk – this is a very bad idea. Here are 3 strong reasons why:

  1. First of all, you’re breaking the web for your own customers, as seen above. When your customers try to use a new start-up service based on websockets, their experience will be severely degraded, perhaps to the point where the service is unusable. Time and money will be spent trying to figure out what’s going on, and people will not be happy to learn that their ISP is to blame.
  2. Second, there’s a real privacy issue here. I don’t really want to go into it, but I don’t trust my ISP (any ISP) to securely look after data associated with my account, let alone all the web content I look at. I have Google web history disabled. I don’t want my ISP building up a profile of what the people in this house look at online. There’s a big difference between recording the URLs I go to and actually retrieving their content.
  3. Third, it’s downright dangerous. What if I were controlling a medical device via a web interface and TalkTalk were interfering by killing my connections or by replaying my requests? What if there was a security system or some other sensitive controller on the other end? There’s no way on earth TalkTalk should be making requests with unknown effects to an unknown service that they have not been authorized to use. The TalkTalk legal team should consider this an emergency. Something is going to break, perhaps with fatal consequences, and they are going to get sued.

If you want to read more opinions on this issue, try Googling TalkTalk 62.24.252.133. Lots of people have run into this problem and are upset for various reasons.

See also: Phorm.


Max Tabs

03:48 December 4th, 2012 by terry. Posted under browser extensions, tech. | 1 Comment »

Here’s the second of several Chrome and Chromium browser extensions I’ve recently written. Earlier, I posted some of the background motivation in Alternate browsing realities.

After installing Max Tabs, your browser will not allow you to have more than 15 tabs open at once. Any time you try to open more tabs, it will automatically close existing tabs to keep you at the limit. If you later close some tabs and go below the limit, tabs will be reopened to show URLs that were previously automatically closed. I.e., the URLs in tabs that are closed are not forgotten, they’re stored until you’re down to a reasonable number of tabs. (The URLs are not stored across browser sessions, though they easily could be.)

The main idea here is to limit the amount of memory Chrome consumes in keeping tabs open for you. I regularly have about 50 tabs open, sometimes for weeks or even months, on pages I’m planning to read. My laptop gets bogged down as Chrome consumes more and more memory. I’ve long wanted something to limit my tab habit. I didn’t really like any of the options I found in the Chrome Store, so I wrote my own. In case you’re wondering, the extension closes your rightmost open tabs.

Max Tabs installs a context menu item that shows you the number of URLs it has closed. If you click the context menu item, you can disable the extension, at which point it will immediately open tabs for all URLs it automatically closed.

Note that the extension starts out disabled. I set it up that way so it would be less alarming on installation (if you have over 15 tabs open when you install it, it will immediately close as many tabs as needed). Enable it via the context menu.

The extension is not in the Chrome Web Store yet. It’s still very easy to install: just click here to download the extension, then follow these instructions.

If you’re a programmer, or just curious about how to build Chrome extensions, the source code is available on Github. For info on what URLs the extension has closed tabs for, you can look in the console of its background page, accessible from chrome://extensions.


Open It Later

03:07 December 4th, 2012 by terry. Posted under browser extensions, tech. | Comments Off on Open It Later

Here’s the first of several Chrome and Chromium browser extensions I’ve recently written. Earlier, I posted some of the background motivation in Alternate browsing realities.

After installing Open It Later, your browser will randomly delay following links you click on. That is, instead of following the link in your existing tab, it immediately closes the tab! :-) If you open a new tab and try to go to a URL, that tab will immediately close too. The URL you were trying to reach will be opened in a new tab at a random future time, between 15 seconds and 5 minutes later.

This is pretty silly, of course. It deliberately goes directly against the idea that there should be an immediate (useful) reaction from your browser when you click a link. Think of it as something that slows you down, that makes your browsing more considered, that gives you a pause during which you might forget about something that you didn’t really need to read anyway.

Open It Later installs a context menu item that shows you the number of URLs that are pending opening. Click the context menu item to disable the extension. Not only will it ungrudgingly disable itself without pause, it will also immediately open all URLs that were scheduled to be opened in the future.

The extension is not in the Chrome Web Store yet. It’s still very easy to install: just click here to download the extension, then follow these instructions.

If you’re a programmer, or just curious about how to build Chrome extensions, the source code is available on Github. For info on when the extension plans to open your URLs, you can look in the console of its background page, accessible from chrome://extensions.

Special thanks to Hugh McLeod, who (unknowingly) provided Open It Later‘s Snake Oil icon:

Image: Hugh McLeod


Alternate browsing realities

01:28 December 4th, 2012 by terry. Posted under browser extensions, tech. | 2 Comments »

I find it interesting to look for things we take for granted, and to ask what the world would look like if the basic assumptions were outlandishly violated.

Recently I’ve been thinking about browsing. What do we all take for granted when browsing? A biggie is that when we click on a link (or open a new tab) the browser should take us to what we asked for, and do it immediately.

Below are some fanciful ways in which that could be different. You can probably think of others. Imagine if:

  • When you click a link, the new page is shown as usual, but only at some random point in the future. Clicking a link or opening a new tab on a URL, causes your current tab to immediately close!
  • Your browser restricts you to having (say) at most 10 tabs open. If you try to open more, the browser automatically picks another open tab and closes it. When you drop back under 10 tabs, a tab that was automatically closed pops into existence again.
  • When you click to follow a link or you open a new tab, the page appears in someone else’s browser, not in yours.
  • You and a group of friends are limited in the number of total tabs you can collectively have open. If you open a tab that takes you over the limit, a random tab is closed in the browser of a random group member. When the group drops under the limit, the tab is re-opened in the browser of a random group member.
  • You and a group of friends are limited so that only one of you can be looking at any given URL. I.e., if you go to a URL that one of your group already has open, their browser automatically closes their tab.
  • When you click on a link, your browser shows you the page and the page also appears in the browsers of a group of friends. If a friend then clicks on a link on that page, your tab follows along.
  • When reading an interesting page, with one click you can send the URL to a group of friends, whose browsers all load the page.

The nice thing about this kind of blue-sky thinking is that it starts out as frivolous or even ridiculous, but can quickly lead to interesting new approaches. For example, the idea of opening tabs in the future comes directly from questioning the immediacy of link following. Hot on the heels of the ridiculous idea of never following links at all, we land right next to the idea of a Read-Later button that millions of people already find very useful.

Anyway….. I decided to have some fun and implement several of the above for the Chrome and Chromium. I’ll write them up separately very soon.


A simple way to calculate the day of the week for any day of a given year

21:39 November 11th, 2012 by terry. Posted under me, other, python. | 2 Comments »

March 29th

Image: Jeremy Church

The other day I read a tweet about how someone was impressed that a friend had been able to tell them the day of the week given an arbitrary date.

There are a bunch of general methods to do this listed on the Wikipedia page for Determination of the day of the week. Typically, there are several steps involved, and you need to memorize some small tables of numbers.

I used to practice that mental calculation (and many others) when I was about 16. Although all the steps are basic arithmetic, it’s not easy to do the calculation in your head in a couple of seconds. Given that most of these questions that you’re likely to face in day-to-day life will be about the current year, it seemed like it might be a poor tradeoff to learn the complicated method to calculate the day of the week for any date if there was a simpler way to do it for a specific year.

The method I came up with after that observation is really simple. It just requires you to memorize a single 12-digit sequence for the current year. The 12 digits correspond to the first Monday for each month.

For example, the sequence for 2012 is 265-274-263-153. Suppose you’ve memorized the sequence and you need to know the day of the week for March 29th. You can trivially calculate that it’s a Thursday. You take the 3rd digit of the sequence (because March is the 3rd month), which is 5. That tells you that the 5th of March was a Monday. Then you just go backward or forward as many weeks and days as you need. The 5th was a Monday, so the 12th, 19th, and 26th were too, which means the 29th was a Thursday.

It’s nice because the amount you need to memorize is small, and you can memorize less digits if you only want to cover a shorter period. The calculation is very simple and always the same in every case, and you never have to think about leap years. At the start of each year you just memorize a single sequence, which is quickly reinforced once you use it a few times.

Here’s Python code to print the sequence for any year.

#!/usr/bin/env python

import datetime, sys

try:
    year = int(sys.argv[1])
except IndexError:
    year = datetime.datetime.today().year

firstDayToFirstMonday = ['1st', '7th', '6th', '5th', '4th', '3rd', '2nd']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
summary = ''

for month in range(12):
    firstOfMonth = datetime.datetime(year, month + 1, 1).weekday()
    firstMonday = firstDayToFirstMonday[firstOfMonth]
    print months[month], firstMonday
    summary += firstMonday[0]

print 'Summary:', '-'.join(summary[x:x+3] for x in range(0, 12, 3))

The output for 2012 looks like

Jan 2nd
Feb 6th
Mar 5th
Apr 2nd
May 7th
Jun 4th
Jul 2nd
Aug 6th
Sep 3rd
Oct 1st
Nov 5th
Dec 3rd
Summary: 265-274-263-153

The memory task is made simpler by the fact that there are only 14 different possible sequences. Or, if you consider just the last 10 digits of the sequences (i.e., starting from March), there are only 7 possible sequences. There are only 14 different sequences, so if you use this method in the long term, you’ll find the effort of remembering a sequence will pay off when it re-appears. E.g., 2013 and 2019 both have sequence 744-163-152-742. There are other nice things you can learn that can also make the memorization and switching between years easier (see the Corresponding months section on the above Wikipedia page).

Here are the sequences through 2032:

2012 265-274-263-153
2013 744-163-152-742
2014 633-752-741-631
2015 522-641-637-527
2016 417-426-415-375
2017 266-315-374-264
2018 155-274-263-153
2019 744-163-152-742
2020 632-641-637-527
2021 411-537-526-416
2022 377-426-415-375
2023 266-315-374-264
2024 154-163-152-742
2025 633-752-741-631
2026 522-641-637-527
2027 411-537-526-416
2028 376-315-374-264
2029 155-274-263-153
2030 744-163-152-742
2031 633-752-741-631
2032 521-537-526-416

Simpler Twisted deferred code via decorated callbacks

23:52 October 14th, 2012 by terry. Posted under deferreds, python, tech, twisted. | 10 Comments »

This morning I was thinking about Twisted deferreds and how people find them difficult to grasp, but how they’re conceptually simple once you get it. I guess most of us tell people a deferred is something to hold a result that hasn’t arrived yet. Sometimes, though, deferreds do have a result in them immediately (e.g., using succeed or fail to get an already-fired deferred).

I wondered if it might work to tell people to think of a deferred as really being the result. If that were literally true, then instead of writing:

d = getPage(...)
d.addErrback(errcheck, args)
d.addCallback(cleanup, args)
d.addCallback(reformat, args)
return d

We might write something like:

result1 = getPage(...)
result2 = errcheck(result1, args)
result3 = cleanup(result2, args)
return reformat(result3, args)

And if you could write that, you could obviously instead write:

return reformat(cleanup(errcheck(getPage(...), args), args), args)

If we could write Twisted code that way, I think using deferreds would be simpler for people unfamiliar with them. We could show them Twisted code and not even have to mention deferreds (see below).

In the style we’re all used to, the programmer manually adds callbacks and errbacks. That’s basically boilerplate. It gets worse when you then need to also use DeferredList, etc. It’s a little confusing to read deferred code at first, because you need to know that the deferred result/failure is automatically passed as the first arg to callbacks/errbacks. It seems to take a year or more for people to finally realize how the callback & errback chains actually interact :-) Also, I wonder how comfortable programmers are with code ordered innermost function first, as in the normal d.addCallback(inner).addCallback(outer) Twisted style, versus outer(inner()), as in the line above.

Anyway… I realized we CAN let people use the succinct style above, by putting boilerplate into decorators. I wrote two decorators, called (surprise!) callback and errback. You can do this:

@errback
def errcheck(failure, arg):
    ...

@callback
def cleanup(page, arg):
    ...

@callback
def reformat(page, arg):
    ...

reformat(cleanup(errcheck(getPage(...), arg1), arg2), arg3)

The deferred callback and errback chains are hooked up automatically. You still get a regular deferred back as the return value.

And… the “deferred” aspect of the code (or at least the need to talk about or explain it) has conveniently vanished.

You can also do things like

func1(getDeferred1(), errcheck(func2(getDeferred2(), getDeferred3())))

This gets the result of deferreds 2 & 3 and (if neither fails) passes the result of calling func2 on both results through to func1, which is called along with the result of deferred 1. You don’t need to use DeferredLists, as the decorator makes them for you. The errcheck function wont be called at all unless there’s an error.

That’s nice compared to the verbose equivalent:

d1 = DeferredList([getDeferred2(), getDeferred3()])
d1.addCallback(func2)
d1.addErrback(errcheck)
d2 = DeferredList([getDeferred1(), d1])
d2.addCallback(func1)

Or the more compact but awkward:

DeferredList([getDeferred(), DeferredList([getDeferred(), getDeferred()]).addCallback(
    func2).addErrback(errcheck)]).addCallback(func1)

There’s lots more that could be said about this, but that’s enough for now. The code (surely not bulletproof) and some tests are on Github. I’ll add a README sometime soon. This is still pretty much proof of concept, and some it could be done slightly differently. I’m happy to discuss in more detail if people are interested.


SOBGTR OCCC AILD FUNEX?

08:10 August 10th, 2012 by terry. Posted under me, other. | Comments Off on SOBGTR OCCC AILD FUNEX?

Suppose you had to pick a very small set of character strings that you, and only you, could identify without hesitation in a particular way. What would you choose? How small a set could you choose and still be unique? For example, SOBGTR OCCC AILD FUNEX? is a set of strings that I think would uniquely identify me. (My interpretation is below.) I’m pretty sure that almost any subset of 3 of them would suffice. Coming up with a set of two wouldn’t be hard, I don’t think – but it feels risky.

There are 7 billion people on the planet. So if you just pick 3 reasonably obscure acronyms, e.g., things that only 1 person in 2000 would recognize, you’re heading in the right direction (since 2000 cubed is 8 billion). But that’s only if the obscurity of the things you pick is independent. For example, it’s less good to pick 3 computer acronyms from the 1960s than to choose just one of them plus some things from very different areas of your knowledge.

The rules

  1. Each of your strings with its meaning to you must be findable on Google.
  2. To match with you, another person must interpret all your strings the way you do.

Rule 1 prevents you from choosing something like your bank PIN number, that only you could possibly know. Without this rule, everyone could trivially choose a set of one string. The rule makes thinking up a uniquely identifying set for yourself like a game. Given that all your strings and their interpretations are on Google, each of your strings will likely be recognized by someone in the way you recognize it, so your set will probably have at least 2 strings. You need to choose a set of strings whose set of interpretations, taken as a whole, make you unique (Rule 2).

Why is this interesting?

I find this interesting for many reasons. It seems clear that uniquely identifying sets are fairly easy to construct for people and they’re very small. Certainly small enough to fit in a tweet. Although it’s easy to make a set for yourself, it’s hard to make one for someone else – you might even argue that by definition it’s not possible. If someone else makes one, you can’t produce their set of interpretations without spending time on Google, and even then you’d probably have to know the person pretty well.

Is there a new authentication scheme here somewhere? It’s tempting to think yes, but there probably isn’t. This is less secure than asking people for a set of secrets that are not each findable in Google, so anything you come up with is almost certain to be less secure than the same thing based on a set of actual secrets. It’s more of a fun thought exercise (or Twitter game). It’s not hard to imagine some form of authentication. For example, identify which of a set of symbols are special to you (avoiding others chosen randomly from, say, the set of all acronyms), and their correct interpretations for you, and do it rapidly. Or if a clone shows up one day, claiming to be you, and you’ve thoughtfully put a sealed set of unique symbol strings in your safe, you should be able to convince people that you’re the real you :-)

Answer

Here’s my unhesitating interpretation of the set of 4 strings above:

Remember, to be me you have to get them all. It’s not enough to get a couple, or even three of them.


describejson – a Python script for summarizing JSON structure

20:32 August 9th, 2012 by terry. Posted under python, tech. | 3 Comments »

Yesterday I was sent a 24M JSON file and needed to look through it to give someone an opinion on its contents. I did what I normally do to look at JSON, piping it into python -m json.tool. The result looked pretty good, I scrolled through some screens with a single long list and jumped to the bottom to see what looked like the end of the list. What I didn’t know at the time was that the output was 495,647 lines long! And there was some important stuff in the middle of the output that I didn’t see at all.

So I decided to write a quick program to recursively summarize JSON. You can grab it from Github at https://github.com/terrycojones/describejson.

Usage is simple, just send it JSON on stdin. Here’s an example input:

{
  "cats": 3,
  "dogs": 6,
  "parrots": 1
}

Which gets you this output:

$ python describejson.py < input.json
1 dict of length 3. Values:
  3 ints

The output is a little cryptic, but you’ll get used to it (and I may improve it). In words, this is telling you that (after loading the JSON) the input contained 1 Python dictionary with 3 elements. The values of the 3 elements are all integers. The indentation is meaningful, of course. You can see that the script is summarizing the fact that the 3 values in the dict were all of the same type.

Here’s another sample input:

[
  ["fluffy", "kitty", "ginger"],
  ["fido", "spot", "rover"],
  ["squawk"]
]

Which gets you:

$ python describejson.py < input.json
1 list of length 3. Values:
  2 lists of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

In words, the input was a list of length 3. Its contents were 2 lists of length 3 that both contained 3 unicode strings, and a final list that contains just a single unicode string.

Specifying equality strictness

The script currently takes just one option, --strictness (or just -s) to indicate how strict it should be in deciding whether things are “the same” in order to summarize them. In the above output, the default strictness length is used, so the script considers the first two inner lists to be collapsible in the summary, and prints a separate line for the last list since it’s of a different length. Here’s the output from running with --strictness type:

$ python describejson.py --strictness type < input.json
1 list of length 3. Values:
  3 lists of length 3. Values:
    3 unicodes

The lists are all considered equal here. The output is a little misleading, since it tells us there are 3 lists of length 3, each containing 3 unicodes. I may fix that.

We can also be more strict. Here’s the output from --strictness keys:

$ python describejson.py --strictness keys < input.json
1 list of length 3. Values:
  1 list of length 3. Values:
    3 unicodes
  1 list of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

The 3 inner lists are each printed separately because their contents differ. The keys argument is also a bit confusing for lists, it just means the list values. It’s clearer when you have dictionaries in the input.

This input:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "d": 4,
    "e": 5,
    "f": 6
  }
]

produces

$ python describejson.py < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

I.e., one list, containing 2 dictionaries, each containing 3 int values. Note that this is using the default of --strictness length so the two dicts are considered the same. If we run that input with strictness of keys, we’ll instead get this:

$ python describejson.py --strictness keys < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

The dicts are considered different because their keys differ. If we change the input to make the keys the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

and run again with --strictness keys, the dicts are considered the same:

$ python describejson.py --strictness keys < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

but if we use --strictness equal, the dicts will be considered different:

$ python describejson.py --strictness equal < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

Finally, making the dicts the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 1,
    "b": 2,
    "c": 3
  }
]

and running with --strictness equal will collapse the summary as you’d expect:

$ python describejson.py --strictness equal < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

Hopefully it’s clear that by being less strict on matching you’ll get more concise output in which things are casually considered “the same” and if you’re more strict you’ll get more verbose output, all the way to using strict equality for both lists and dicts.

Here’s the full set of --strictness options:

  • type: compare things by type only.
  • length: compare lists and objects by length.
  • keys: compare lists by equality, dicts by keys.
  • equal: compare lists and dicts by equality.

Improvements

The naming of the --strictness options could be improved. The keys option should probably be called values (but that is confusing, since dictionaries have values and it’s a comparison based on their keys!). A values option should probably also compare the value of primitive things, like integers and strings.

There are quite a few other things I might do to this script, if I ever have time. It would be helpful to print out some keys and values when these are short and unchanging. It would be good to show an example representative value of something that repeats (modulo strictness) many times. It might be good to be able to limit the depth to go into a JSON structure.

Overall though, I already find the script useful and I’m not in a rush to “improve” it by adding features. You can though :-)

You might also find it helpful to take what you learn about a JSON object via describe JSON and use that to grep out specific pieces of the structure using jsongrep.py.

If you’re curious, here’s the 24-line output summary of the 24M JSON I received. Much more concise than the nearly 1/2 a million lines from python -m json.tool:

1 dict of length 3. Values:
  1 int
  1 dict of length 4. Values:
    1 list of length 17993. Values:
      17993 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
    1 list of length 0.
    1 list of length 11907. Values:
      11907 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 1. Values:
          1 unicode
        2 unicodes
    1 list of length 28068. Values:
      28068 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
  1 unicode

Autovivification in Python: nested defaultdicts with a specific final type

00:10 May 26th, 2012 by terry. Posted under python, tech. | 11 Comments »

I quite often miss the flexibility of autovivification in Python. That Wikipedia link shows a cute way to get what Perl has:

from collections import defaultdict

def tree():
    return defaultdict(tree)

lupin = tree()
lupin["express"][3] = "stand and deliver"

It’s interesting to think about what’s going on in the above code and why it works. I really like defaultdict.

What I more often want, though, is not infinitely deep nested dictionaries like the above, but a (known) finite number of nested defaultdicts with a specific type at the final level. Here’s a tiny function I wrote to get just that:

from collections import defaultdict

def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels - 1, final)))

So let's say you were counting the number of occurrences of words said by people by year, month, and day in a chat room. You'd write:

words = autovivify(5, int)

words["sam"][2012][5][25]["hello"] += 1
words["sue"][2012][5][24]["today"] += 1

Etc. It's pretty trivial, but it was fun to think about how to do it with a function and to play with some alternatives. You could do it manually with nested lambdas:

words = defaultdict(lambda: defaultdict(int))
words["sam"]["hello"] += 1

But that gets messy quickly and isn't nearly as much fun.


The love of pleasure and the love of action

13:03 April 28th, 2012 by terry. Posted under books, Gibbon. | Comments Off on The love of pleasure and the love of action

There are two very natural propensities which we may distinguish in the most virtuous and liberal dispositions, the love of pleasure and the love of action. If the former is refined by art and learning, improved by the charms of social intercourse, and corrected by a just regard to economy, to health, and to reputation, it is productive of the greatest part of the happiness of private life. The love of action is a principle of a much stronger and more doubtful nature. It often leads to anger, to ambition, and to revenge; but when it is guided by the sense of propriety and benevolence, it becomes the parent of every virtue, and if those virtues are accompanied with equal abilities, a family, a state, or an empire, may be indebted for their safety and prosperity to the undaunted courage of a single man. To the love of pleasure we may therefore ascribe most of the agreeable, to the love of action we may attribute most of the useful and respectable, qualifications. The character in which both the one and the other should be united and harmonized, would seem to constitute the most perfect idea of human nature. The insensible and inactive disposition, which should be supposed alike destitute of both, would be rejected, by the common consent of mankind, as utterly incapable of procuring any happiness to the individual, or any public benefit to the world. But it was not in this world, that the primitive Christians were desirous of making themselves either agreeable or useful.

Edward Gibbon
From Chapter XV: Progress Of The Christian Religion.
http://ancienthistory.about.com/library/bl/bl_text_gibbon_1_15_5.htm


A more melancholy duty is imposed on the historian

16:09 April 8th, 2012 by terry. Posted under books, Gibbon. | Comments Off on A more melancholy duty is imposed on the historian

The theologian may indulge the pleasing task of describing Religion as she descended from Heaven, arrayed in her native purity. A more melancholy duty is imposed on the historian. He must discover the inevitable mixture of error and corruption, which she contracted in a long residence upon earth, among a weak and degenerate race of beings.

From Gibbon vol 1 ch 15.


The ruling passion of his soul

07:52 February 28th, 2012 by terry. Posted under books, Gibbon. | Comments Off on The ruling passion of his soul

Yet Commodus was not, as he has been represented, a tiger born with an insatiate thirst of human blood, and capable, from his infancy, of the most inhuman actions. Nature had formed him of a weak rather than a wicked disposition. His simplicity and timidity rendered him the slave of his attendants, who gradually corrupted his mind. His cruelty, which at first obeyed the dictates of others, degenerated into habit, and at length became the ruling passion of his soul.

Gibbon.


It is almost superfluous to enumerate the unworthy successors of Augustus

07:42 February 7th, 2012 by terry. Posted under books, Gibbon. | Comments Off on It is almost superfluous to enumerate the unworthy successors of Augustus

It is almost superfluous to enumerate the unworthy successors of Augustus. Their unparalleled vices, and the splendid theatre on which they were acted, have saved them from oblivion. The dark, unrelenting Tiberius, the furious Caligula, the feeble Claudius, the profligate and cruel Nero, the beastly Vitellius, and the timid, inhuman Domitian, are condemned to everlasting infamy. During fourscore years (excepting only the short and doubtful respite of Vespasian’s reign) Rome groaned beneath an unremitting tyranny, which exterminated the ancient families of the republic, and was fatal to almost every virtue and every talent that arose in that unhappy period.

From Gibbon, Decline & Fall of the Roman Empire (vol 1).


Women’s guide to HTTP status codes for dealing with unwanted geek advances

07:18 January 21st, 2012 by terry. Posted under tech. | 1 Comment »

Here’s a women’s guide to the most useful HTTP status codes for dealing with unwanted geek advances. Someone should make and sell a deck of these.


301 MOVED PERMANENTLY
303 SEE OTHER
305 USE PROXY
306 RESERVED
307 TEMPORARY REDIRECT
400 BAD REQUEST
401 UNAUTHORIZED
402 PAYMENT REQUIRED
403 FORBIDDEN
404 NOT FOUND
405 METHOD NOT ALLOWED
406 NOT ACCEPTABLE
407 PROXY AUTHENTICATION REQUIRED
408 REQUEST TIMEOUT
409 CONFLICT
410 GONE
411 LENGTH REQUIRED
412 PRECONDITION FAILED
413 REQUEST ENTITY TOO LARGE
414 REQUEST-URI TOO LONG
415 UNSUPPORTED MEDIA TYPE
416 REQUESTED RANGE NOT SATISFIABLE
417 EXPECTATION FAILED
422 UNPROCESSABLE ENTITY
423 LOCKED
424 FAILED DEPENDENCY
426 UPGRADE REQUIRED
500 INTERNAL SERVER ERROR
501 NOT IMPLEMENTED
502 BAD GATEWAY
503 SERVICE UNAVAILABLE
504 GATEWAY TIMEOUT
507 INSUFFICIENT STORAGE
508 LOOP DETECTED
510 NOT EXTENDED

Source: Hypertext Transfer Protocol (HTTP) Status Code Registry


Emacs buffer mode histogram

12:45 November 10th, 2011 by terry. Posted under me, other, python, tech. | 6 Comments »

Tonight I noticed that I had over 200 buffers open in emacs. I’ve been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I’m programming versus working on documentation, etc.

Because it’s emacs, it wasn’t hard to write a function to display a buffer mode histogram. Here’s mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I’m not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I’m a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it’s a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
  "Display a histogram of emacs buffer modes."
  (interactive)
  (let* ((totals '())
         (buffers (buffer-list()))
         (total-buffers (length buffers))
         (ht (make-hash-table :test 'equal)))
    (save-excursion
      (dolist (buffer buffers)
        (set-buffer buffer)
        (let
            ((mode-name (symbol-name major-mode)))
          (puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
    (maphash (lambda (key value)
               (setq totals (cons (list key value) totals)))
             ht)
    (setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
    (with-output-to-temp-buffer "Buffer mode histogram"
      (princ (format "%d buffers open, in %d distinct modes\n\n"
                      total-buffers (length totals)))
      (dolist (item totals)
        (let
            ((key (car item))
             (count (cadr item)))
          (if (equal (substring key -5) "-mode")
              (setq key (substring key 0 -5)))
          (princ (format "%2d %20s %s\n" count key
                         (make-string count ?+))))))))

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.


The Grapes of Wrath & Occupy Wall Street

17:37 October 31st, 2011 by terry. Posted under books, me, politics. | 28 Comments »

I’m reading The Grapes of Wrath for the first time. I can’t believe it took me so long to finally read it. It’s great.

Below is a section I just ran across that I imagine will resonate strongly with the people involved in Occupy Wall Street. I’ve long been fascinated to watch how power tries to maintain itself by attempting to enforce isolation and to restrict information flow, and, on the contrary, how increased information flow between the subjects of power naturally undermines this basis. Awareness of these opposing forces, even if not explicitly understood, is what I think accounts for the tenacity and ferocity on both sides of the OWS (and many other) movements, even (especially) when the movements are still only tiny. The occupiers experience the surge of energy and determination and self-identification that comes from solidarity, while those in power recognize the danger and act in heavy-handed ways to try to crush it, usually after trying to ignore and then ridicule. The consistent characteristic of the reaction against these movements, as Steinbeck notes, is that those in power do not understand what’s going on. So in their efforts to snuff out the protests they instead fan the flames, which they then have to react even more violently to. It seems an extraordinarily difficult task for power to successfully manage to defuse a popular movement without resorting to extremes. Hence the absurd justifications of needing to clean (often already cleaned – by the protesters) public spaces, to make the public spaces once again available to the public, etc. Disperse, ridicule, isolate. If the gentle pretenses do not work, then we’ll do what we can to get rid of or evade the media (in all its forms), and then come in and beat the shit out of you.

So for all those out there in the OWS camps around the world (don’t forget there were protests in almost one thousand cities worldwide), and especially for those in the US, here’s some beautiful Steinbeck:

One man, one family driven from the land; this rusty car creaking along the highway to the West. I lost my land, a single tractor took my land. I’m alone and I am bewildered. In the night one family camps in a ditch and other family pulls in and the tents come out. The two men squat on their hams and the women and children listen. Here’s the node, you who hate change and fear revolution. Keep these two squatting men apart; make them hate, fear, suspect each other. Here is the anlage of the thing you fear. This is the zygote. For here “I lost my land” is changed; a cell is split and from its splitting grows the thing you hate — “we lost our land.” The danger is here, for two men are not as lonely and perplexed as one. And from his first “we” there grows a still more dangerous thing; “I have a little food” plus “I have none”. If from this problem the sum is “we have a little food”, the thing is on its way, the movement has direction. Only a little multiplication now, and this land, this tractor are ours. The two-men squatting in a ditch, the little fire, the side-meat stewing in a single pot, the silent, stone-eyed women; behind, the children listening with their souls to words their minds do not understand. The night draws down. The baby has a cold. Here, take this blanket. It’s wool. It was my mothers blanket — take it for the baby. This is the thing to bomb. This is the beginning — from “I” to “we”.

If you who own the things people must have could understand this, you might preserve yourself. If you could separate causes from results, if you could know that Paine, Marx, Jefferson, Lenin were results, not causes, you might survive. But that you cannot know. For the quality of owning freezes you forever into “I”, and cuts you off forever from the “we”.

The Western states are nervous under the beginning change. Need is the stimulus to concept, concept to action. A half-million people moving over the country; one million more restive, ready to move; 10 million more feeling the first nervousness.

And tractors turning the multiple furrows in the vacant land.