Do stuff on things, in parallel

18:10 August 5th, 2017 by terry. Posted under me. | 2 Comments »

Most people don’t really know how to use the tools they spend the most time using: the shell and their editor. It’s worth stopping once in a while to teach yourself new things. Sometimes you find something so powerful that it has a huge impact, in which case it’s even worth stopping other people so they can maybe learn it too.

GNU parallel is such a tool. Actually, I think it’s the most powerful shell utility I’ve ever used. I heard about it when it first came out (in 2011), read the manual, thought “amazing!” but immediately went back to using xargs (I thought I was too busy).

Often I find I need to be exposed to these tools multiple times before I really pick them up and use them regularly. Try to find one thing that it’s useful for you and to start using it just for that. But keep in mind what else it can do and gradually expand your usage. Read the manual repeatedly (over the years). Write down a few options or command lines and stick the piece of paper near your monitor. Or just go to talks by geeky friends.

Things

You very often need to take some action on a collection of things. E.g., list a set of files.

How do we get the list of things to operate on? The shell has always provided some help, and there are various standalone tools that can help.

Globbing

The shell gave you globbing

# List all files ending in .c
$ ls *.c

# List all files ending in .c or .h
$ ls *.[ch]

and glob made its way into programming languages

# Python
from glob import glob
print(glob('*.c'))

Making things up: echo

echo is extremely useful. It’s like print in a typical programming language.

$ echo a b c

Making things up: brace expansion

In bash (and some other shells), brace expansion is very useful

$ echo {chair,stool,table}.c
chair.c stool.c table.c

Note the important difference from globbing: the things don’t have to already exist as files or directories. Brace expansion is just creating strings while globbing is expanding patterns into the names of pre-existing things (file and directory names).

When used more than once, brace expansion gets you the cross product:

$ echo {chair,stool,table}.{c,h}
chair.c stool.c table.c chair.h stool.h table.h

This is (roughly) just a loop in a loop.

Command expansion

If you put a command in $(...), the shell runs the command and replaces the whole expression with the output of the command. So this

$ wc -l $(grep -l hello *.c)

runs wc -l on the .c files that contain hello.

Finding things

It often becomes awkward to use globbing, so there’s a separate find command that can find things (files or directories).

# Recursively find files whose names end in .c and print their names.
$ find . -type f -name '*.c' -print

You can’t do that with globbing, unless you use

$ ls *.c */*.c */*/*.c */*/*/*.c

which has obvious limitations.

Apart from being able to walk the filesystem, find has many options to only return files (or directories) with certain properties.

For example, here we find files whose names contain ‘abc’, with a size of over 1MB and that have been modified in the last 2 weeks:

$ find . -type f -name '*abc*' -mtime -2w -size +1MB

Warning: find is also a bit cryptic. I still don’t really understand it, after over 30 years!

Altering things

It’s very common that you need to change the names of things slightly. There are many standard UNIX tools that can help: tr, cut, basename, dirname, sed, awk, perl. It’s worth learning very basic usage of these things (especially tr and cut). See next sections for some simple examples.

Do stuff on things

Shell variables and loops

Many people do not realize that the shell is a programming language. It has variables and loops, which you can use to build up a list of things:

$ for year in 2015 2016 2017
  do
    for name in sally jack sue
    do
        mkdir -p $year/$name
    done
  done

The above could be done with brace expansion:

$ mkdir -p {2015,2016,2017}/{sally,jack,sue}

Often you use the value of a variable to make a file with a related name. Use command expansion $(...) plus one of the above altering tools to make the new name:

$ for file in *.c
  do
    base=$(echo $file | cut -f1 -d.)
    wc -l < $file > $base.line-count
  done

Using exec in find

Find offered its own limited way to run commands on things it found. E.g.,

# Find .c files and run wc on each of them.
$ find . -name '*.c' -type f -exec 'wc {};'

This was quite limited and it results in wc being run once for each file, which is much slower.

Enter xargs

If you’ve never learned xargs, don’t bother with it, just skip to parallel (see next section).

To complement find, xargs came along, reading a list of things from standard input:

$ find . -name '*.c' | xargs wc

By default, xargs takes all the names on standard input (splitting on whitespace and newline) and puts them at the end of the command you give it.

A big difference in the above is that wc is (normally) only run once.

xargs can also make sure that the command line isn’t too long (it will invoke wc more than once if so) and can be told to only give a certain number of things to a command.

This will give 5 things at a time to wc:

$ find . -name '*.c' | xargs -n 5 wc

You can also tell xargs where to put the things in the command

$ mkdir /tmp/c-files
$ find . -name '*.c' | xargs -I+ mv + /tmp/c-files

$ echo a b c d e | xargs -n 2 -I+ echo mv + /tmp
mv a b /tmp
mv c d /tmp
mv e /tmp

xargs will run into problems if argument names (usually file names) have spaces or newlines in them. So find and xargs can use the same convention to NUL-separate names:

$ find . -name '*.c' -print0 | xargs -0 wc

This is the accepted / standard safe way to use find & xargs.

Do stuff on things, in parallel

GNU parallel

This was all a bit shit. It was hacky, there were exceptions, there were limitations, there were conflicting versions of programs (e.g., OS X xargs is crappy compared to the Linux version). You could do lots of stuff, and it felt powerful, but you’d often end up writing a shell script if you had to do something slightly different (like make a new file whose name was based on a simple transformation of another file’s name):

$ for file in *.c
  do
    base=$(echo $file | cut -f1 -d.)
    wc -l < $file > $base.line-count
  done

And, when you had your loops and your find and xargs all just so, your commands were still executed one by one. So there you are, on a machine with 8 cores but you’re only using one of them. It’s no big deal if your command is trivial, but if takes an hour, you might be looking at an 8 hour wait instead of a 1 hour one.

This is not so easily solved. You could do something like this:

$ for file in *.c
  do
    wc $file &
  done

but that runs all your commands at once, with no regard for how many cores you actually have. That can be even worse than just running one command after another. What you in fact want is one command running on each core, with a queue of pending commands that are started as cores become free.

GNU parallel solves all these problems. It gives you looping, can read input in words or complete lines, has powerful ways to use and manipulate the names it is given, does things in parallel (but can still order its output to match the input). It can even send jobs to remote machines.

Let’s have a look.

Emulating xargs

You can use parallel in place of xargs. Here’s the setup:

$ mkdir /tmp/test
$ cd /tmp/test
$ touch a b c
$ ls -l
total 0
-rw-r--r--  1 terry  wheel  0 Aug  5 17:09 a
-rw-r--r--  1 terry  wheel  0 Aug  5 17:09 b
-rw-r--r--  1 terry  wheel  0 Aug  5 17:09 c

The following passes each file name to echo individually:

$ ls | parallel echo
a
b
c

Whereas this collects the multiple names and puts them all to one invocation of echo:

$ ls | parallel --xargs echo
a b c

Note that this is subtly different from the following:

$ echo * | parallel echo
a b c

$ echo * | parallel --xargs echo
a b c

That’s because ls will write one filename per line of output when it detects that its stdout is not a terminal (contrast what you get when you run $ ls with ls | cat). On the other hand, echo * writes just one line of output.

In both the latter (echo) examples, parallel is just getting one line of input and is giving that line to echo. In the former case (ls) it gets multiple lines of input and the --xargs option tells it to collect those lines and put them on the command line to echo.

Note that I don’t understand why parallel -m and parallel -X don’t also collect input lines in the way --xargs does. The manual page for parallel seems to indicate that they should.

Sending commands into parallel

$ for year in 2015 2016 2017
  do
    for name in sally jack sue
    do
        echo mkdir -p $year/$name
    done
  done | parallel

parallel can make loops for you

Here’s a cross product loop, just like the above:

$ parallel echo mkdir -p '{1}/{2}' ::: 2015 2016 2017 ::: sally jack sue

Output ordering

Because processes may not finish in the order they’re started:

$ parallel echo ::: $(seq 1 10)
7
8
9
6
5
4
10
3
2
1

there’s a -k option to make sure the output order matches the input:

$ parallel -k echo ::: $(seq 1 20)
1
2
3
4
5
6
7
8
9
10

Reading the names of things from files

$ cat names
sally
jack
sue

$ parallel echo mkdir -p '{1}/{2}' ::: 2015 2016 2017 :::: names

And

$ cat years
2015
2016
2017

$ parallel echo mkdir -p '{1}/{2}' :::: years :::: names

Or read from standard input, from a file, and from the command line:

$ ls *.c | parallel echo '{1} {2} {3}' ::: - ::: years :::: names

Combining input names

Sometimes you don’t want a cross product, you want to combine names (like using zip(...) in Python or (mapcar #'list ...) in lisp to combine multiple lists). Compare

$ parallel echo '{1} {2}' ::: 2015 2016 2017 ::: goat monkey rooster
2015 goat
2016 goat
2016 monkey
2015 rooster
2015 monkey
2016 rooster
2017 goat
2017 monkey
2017 rooster

with

$ parallel echo '{1} {2}' ::: 2015 2016 2017 :::+ goat monkey rooster
2015 goat
2016 monkey
2017 rooster

Modifying names

Parallel has a bunch of ways to edit names. So instead of needing to write a script like this:

$ for file in *.c
  do
    base=$(echo $file | cut -f1 -d.)
    wc -l < $file > $base.line-count
  done

you can just do this:

$ parallel 'wc -l {} > {.}' ::: *.c

There are lots of ways to modify input names, including:

{} The name (i.e., the input line), unmodified
{.} Input line without extension.
{/} Basename of input line. E.g., /home/pete/main.c becomes main.c.
{//} Dirname of input line. E.g., /home/pete/main.c becomes /home/pete.
{/.} Basename of input line without extension. E.g., /home/pete/main.c becomes main.

And the --plus option gives you more, like {..} to remove two dotted suffixes.

You can modify input names individually:

$ parallel echo '{1/} {2.}' ::: data/2015 data/2016 data/2017 ::: sally.c jack.c sue.c
2015 sally
2015 jack
2015 sue
2016 sally
2016 jack
2016 sue
2017 sally
2017 jack
2017 sue

Running on remote machines

The following transfers all *.c files to a remote machine ac, runs wc -l on them (one by one), puts the output into a file that has the .c replaced by .out, returns all the output files to my local machine, and cleans up the files created on the remote:

$ parallel -S ac --transferfile '{}' --return '{.}.out' --cleanup wc -l '{}' \> '{.}'.out ::: *.c

The option combination --transferfile '{}' --return --cleanup '{.}.out' is so common you can abbreviate it to --trc '{}.out'.

$ parallel -S ac --trc '{}.out' wc -l '{}' \> '{.}'.out ::: *.c

A real life example

Here’s a script I wrote

sample=`/bin/pwd | tr / '\012' | egrep 'DA[0-9]+'`

# Collect all read ids, with > replaced by @
cat 03-panel/out/[0-9]*.fasta | egrep '^>' | sed -e 's/^>/@/' > read-ids
count=1

for dir in ../../2016*/Sample_ESW_*${sample}_*
do
    fastq=
    for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
    do
        fastq="$fastq $file"
    done

    # Pull the FASTQ out for the read ids.
    zcat $fastq | fgrep -f read-ids -A 3 | egrep -v -e '^--$' | gzip > run-$count.fastq.gz
    count=`expr $count + 1`
done

I made it faster by running a zcat on each core using parallel:

# Some lines omitted
for dir in ../../2016*/Sample_ESW_*${sample}_*
do
    fastq=
    for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
    do
        fastq="$fastq $file"
    done

    ls $fastq | parallel "(zcat {} | fgrep -f read-ids -A 3 | egrep -v -e '^--$')" | gzip > run-$count.fastq.gz
done

But that’s still inefficient because my main loop waits until each dir is completely processed (i.e., all its fastq files have been run). So as the last fastq files are being processed, cores are unused.

So, faster:

# Some lines omitted
for dir in ../../2016*/Sample_ESW_*${sample}_*
do
    fastq=
    for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
    do
        fastq="$fastq $file"
    done

    echo "zcat $fastq | fgrep -f read-ids -A 3 | egrep -v -e '^--\$' | gzip > run-$count.fastq.gz"
done | parallel

This uses the fact that parallel will treat its input lines as commands to run (in parallel) if it’s not given an explicit command to run.

And this could have been made faster by using parallel to run the zcat.

Counting sequences in FASTQ

I wanted to count the number of nucelotide sequences (billions of them) spread over nearly 6000 FASTQ files (found under directories that start with 20):

$ find 20* -maxdepth 2 -name '*.fastq.gz' | parallel --plus --bar "zcat {} | egrep -c '^\\+\$' > {..}.read-count"

The --bar gives a cool progress bar. The --plus makes {..} work (to remove two suffixes).

And there’s much more

Breaking input up by delimiter instead of by line.
Breaking up input into chunks and passing each chunk to a process that reads from stdin (using --pipe or --pipepart).
Stop launching jobs after one (or a percent) fail. E.g., parallel --halt now,fail=1
Kill currently running jobs if one fails.
Resource limiting.
Resuming failed jobs.
Retrying failing commands.
Dry run: --dry-run.
Using tmux to show output.

Installing

On OS X, if you’re using brew:

$ brew install parallel

More info

Run man parallel (see many examples at bottom).
Parallel tutorial.
Wikipedia page.
Some videos.

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Jean Jordaan Says:
August 6th, 2017 at 8:19 am

Wow I’ve been ignoring `parallel` too. Small aside: you can set the `globstar` option with `shopt` to enable `**/*.c` in bash, so you don’t have to say things like `*/*/*.c`.

terrycojones Says:
August 11th, 2017 at 4:12 pm

Thanks Jean – I didn’t know about globstar. I don’t use bash any more. I switched to fish a couple of years ago. Maybe fish has a similar thing. Thanks again.

Terry Jones

Pages

Recent

Categories

Search

Archives

Pages

Archives

Categories

Do stuff on things, in parallel

Things

Globbing

Making things up: echo

Making things up: brace expansion

Command expansion

Finding things

Altering things

Do stuff on things

Shell variables and loops

Using exec in find

Enter xargs

Do stuff on things, in parallel

GNU parallel

Emulating xargs

Sending commands into parallel

parallel can make loops for you

Output ordering

Reading the names of things from files

Combining input names

Modifying names

Running on remote machines

A real life example

Counting sequences in FASTQ

And there’s much more

Installing

More info

2 Responses to “Do stuff on things, in parallel”