Do stuff on things, in parallel
Most people don’t really know how to use the tools they spend the most time using: the shell and their editor. It’s worth stopping once in a while to teach yourself new things. Sometimes you find something so powerful that it has a huge impact, in which case it’s even worth stopping other people so they can maybe learn it too.
GNU parallel is such a tool. Actually, I think it’s the most powerful shell utility I’ve ever used. I heard about it when it first came out (in 2011), read the manual, thought “amazing!” but immediately went back to using xargs
(I thought I was too busy).
Often I find I need to be exposed to these tools multiple times before I really pick them up and use them regularly. Try to find one thing that it’s useful for you and to start using it just for that. But keep in mind what else it can do and gradually expand your usage. Read the manual repeatedly (over the years). Write down a few options or command lines and stick the piece of paper near your monitor. Or just go to talks by geeky friends.
Things
You very often need to take some action on a collection of things. E.g., list a set of files.
How do we get the list of things to operate on? The shell has always provided some help, and there are various standalone tools that can help.
Globbing
The shell gave you globbing
# List all files ending in .c
$ ls *.c
# List all files ending in .c or .h
$ ls *.[ch]
and glob
made its way into programming languages
# Python
from glob import glob
print(glob('*.c'))
Making things up: echo
echo
is extremely useful. It’s like print
in a typical programming language.
$ echo a b c
Making things up: brace expansion
In bash (and some other shells), brace expansion is very useful
$ echo {chair,stool,table}.c
chair.c stool.c table.c
Note the important difference from globbing: the things don’t have to already exist as files or directories. Brace expansion is just creating strings while globbing is expanding patterns into the names of pre-existing things (file and directory names).
When used more than once, brace expansion gets you the cross product:
$ echo {chair,stool,table}.{c,h}
chair.c stool.c table.c chair.h stool.h table.h
This is (roughly) just a loop in a loop.
Command expansion
If you put a command in $(...)
, the shell runs the command and replaces the whole expression with the output of the command. So this
$ wc -l $(grep -l hello *.c)
runs wc -l
on the .c
files that contain hello
.
Finding things
It often becomes awkward to use globbing, so there’s a separate find
command that can find things (files or directories).
# Recursively find files whose names end in .c and print their names.
$ find . -type f -name '*.c' -print
You can’t do that with globbing, unless you use
$ ls *.c */*.c */*/*.c */*/*/*.c
which has obvious limitations.
Apart from being able to walk the filesystem, find
has many options to only return files (or directories) with certain properties.
For example, here we find files whose names contain ‘abc’, with a size of over 1MB and that have been modified in the last 2 weeks:
$ find . -type f -name '*abc*' -mtime -2w -size +1MB
Warning: find
is also a bit cryptic. I still don’t really understand it, after over 30 years!
Altering things
It’s very common that you need to change the names of things slightly. There are many standard UNIX tools that can help: tr
, cut
, basename
, dirname
, sed
, awk
, perl
. It’s worth learning very basic usage of these things (especially tr
and cut
). See next sections for some simple examples.
Do stuff on things
Shell variables and loops
Many people do not realize that the shell is a programming language. It has variables and loops, which you can use to build up a list of things:
$ for year in 2015 2016 2017
do
for name in sally jack sue
do
mkdir -p $year/$name
done
done
The above could be done with brace expansion:
$ mkdir -p {2015,2016,2017}/{sally,jack,sue}
Often you use the value of a variable to make a file with a related name. Use command expansion $(...)
plus one of the above altering tools to make the new name:
$ for file in *.c
do
base=$(echo $file | cut -f1 -d.)
wc -l < $file > $base.line-count
done
Using exec in find
Find offered its own limited way to run commands on things it found. E.g.,
# Find .c files and run wc on each of them.
$ find . -name '*.c' -type f -exec 'wc {};'
This was quite limited and it results in wc
being run once for each file, which is much slower.
Enter xargs
If you’ve never learned xargs
, don’t bother with it, just skip to parallel
(see next section).
To complement find
, xargs came along, reading a list of things from standard input:
$ find . -name '*.c' | xargs wc
By default, xargs
takes all the names on standard input (splitting on whitespace and newline) and puts them at the end of the command you give it.
A big difference in the above is that wc
is (normally) only run once.
xargs
can also make sure that the command line isn’t too long (it will invoke wc
more than once if so) and can be told to only give a certain number of things to a command.
This will give 5 things at a time to wc
:
$ find . -name '*.c' | xargs -n 5 wc
You can also tell xargs
where to put the things in the command
$ mkdir /tmp/c-files
$ find . -name '*.c' | xargs -I+ mv + /tmp/c-files
Or
$ echo a b c d e | xargs -n 2 -I+ echo mv + /tmp
mv a b /tmp
mv c d /tmp
mv e /tmp
xargs
will run into problems if argument names (usually file names) have spaces or newlines in them. So find
and xargs
can use the same convention to NUL-separate names:
$ find . -name '*.c' -print0 | xargs -0 wc
This is the accepted / standard safe way to use find
& xargs
.
Do stuff on things, in parallel
GNU parallel
This was all a bit shit. It was hacky, there were exceptions, there were limitations, there were conflicting versions of programs (e.g., OS X xargs
is crappy compared to the Linux version). You could do lots of stuff, and it felt powerful, but you’d often end up writing a shell script if you had to do something slightly different (like make a new file whose name was based on a simple transformation of another file’s name):
$ for file in *.c
do
base=$(echo $file | cut -f1 -d.)
wc -l < $file > $base.line-count
done
And, when you had your loops and your find
and xargs
all just so, your commands were still executed one by one. So there you are, on a machine with 8 cores but you’re only using one of them. It’s no big deal if your command is trivial, but if takes an hour, you might be looking at an 8 hour wait instead of a 1 hour one.
This is not so easily solved. You could do something like this:
$ for file in *.c
do
wc $file &
done
but that runs all your commands at once, with no regard for how many cores you actually have. That can be even worse than just running one command after another. What you in fact want is one command running on each core, with a queue of pending commands that are started as cores become free.
GNU parallel solves all these problems. It gives you looping, can read input in words or complete lines, has powerful ways to use and manipulate the names it is given, does things in parallel (but can still order its output to match the input). It can even send jobs to remote machines.
Let’s have a look.
Emulating xargs
You can use parallel
in place of xargs
. Here’s the setup:
$ mkdir /tmp/test
$ cd /tmp/test
$ touch a b c
$ ls -l
total 0
-rw-r--r-- 1 terry wheel 0 Aug 5 17:09 a
-rw-r--r-- 1 terry wheel 0 Aug 5 17:09 b
-rw-r--r-- 1 terry wheel 0 Aug 5 17:09 c
The following passes each file name to echo
individually:
$ ls | parallel echo
a
b
c
Whereas this collects the multiple names and puts them all to one invocation of echo
:
$ ls | parallel --xargs echo
a b c
Note that this is subtly different from the following:
$ echo * | parallel echo
a b c
$ echo * | parallel --xargs echo
a b c
That’s because ls
will write one filename per line of output when it detects that its stdout
is not a terminal (contrast what you get when you run $ ls
with ls | cat
). On the other hand, echo *
writes just one line of output.
In both the latter (echo
) examples, parallel
is just getting one line of input and is giving that line to echo
. In the former case (ls
) it gets multiple lines of input and the --xargs
option tells it to collect those lines and put them on the command line to echo
.
Note that I don’t understand why parallel -m
and parallel -X
don’t also collect input lines in the way --xargs
does. The manual page for parallel seems to indicate that they should.
Sending commands into parallel
$ for year in 2015 2016 2017
do
for name in sally jack sue
do
echo mkdir -p $year/$name
done
done | parallel
parallel can make loops for you
Here’s a cross product loop, just like the above:
$ parallel echo mkdir -p '{1}/{2}' ::: 2015 2016 2017 ::: sally jack sue
Output ordering
Because processes may not finish in the order they’re started:
$ parallel echo ::: $(seq 1 10)
7
8
9
6
5
4
10
3
2
1
there’s a -k
option to make sure the output order matches the input:
$ parallel -k echo ::: $(seq 1 20)
1
2
3
4
5
6
7
8
9
10
Reading the names of things from files
$ cat names
sally
jack
sue
$ parallel echo mkdir -p '{1}/{2}' ::: 2015 2016 2017 :::: names
And
$ cat years
2015
2016
2017
$ parallel echo mkdir -p '{1}/{2}' :::: years :::: names
Or read from standard input, from a file, and from the command line:
$ ls *.c | parallel echo '{1} {2} {3}' ::: - ::: years :::: names
Combining input names
Sometimes you don’t want a cross product, you want to combine names (like using zip(...)
in Python or (mapcar #'list ...)
in lisp to combine multiple lists). Compare
$ parallel echo '{1} {2}' ::: 2015 2016 2017 ::: goat monkey rooster
2015 goat
2016 goat
2016 monkey
2015 rooster
2015 monkey
2016 rooster
2017 goat
2017 monkey
2017 rooster
with
$ parallel echo '{1} {2}' ::: 2015 2016 2017 :::+ goat monkey rooster
2015 goat
2016 monkey
2017 rooster
Modifying names
Parallel has a bunch of ways to edit names. So instead of needing to write a script like this:
$ for file in *.c
do
base=$(echo $file | cut -f1 -d.)
wc -l < $file > $base.line-count
done
you can just do this:
$ parallel 'wc -l {} > {.}' ::: *.c
There are lots of ways to modify input names, including:
- {} The name (i.e., the input line), unmodified
- {.} Input line without extension.
- {/} Basename of input line. E.g.,
/home/pete/main.c
becomesmain.c
. - {//} Dirname of input line. E.g.,
/home/pete/main.c
becomes/home/pete
. - {/.} Basename of input line without extension. E.g.,
/home/pete/main.c
becomesmain
.
And the --plus
option gives you more, like {..}
to remove two dotted suffixes.
You can modify input names individually:
$ parallel echo '{1/} {2.}' ::: data/2015 data/2016 data/2017 ::: sally.c jack.c sue.c
2015 sally
2015 jack
2015 sue
2016 sally
2016 jack
2016 sue
2017 sally
2017 jack
2017 sue
Running on remote machines
The following transfers all *.c
files to a remote machine ac
, runs wc -l
on them (one by one), puts the output into a file that has the .c
replaced by .out
, returns all the output files to my local machine, and cleans up the files created on the remote:
$ parallel -S ac --transferfile '{}' --return '{.}.out' --cleanup wc -l '{}' \> '{.}'.out ::: *.c
The option combination --transferfile '{}' --return --cleanup '{.}.out'
is so common you can abbreviate it to --trc '{}.out'
.
$ parallel -S ac --trc '{}.out' wc -l '{}' \> '{.}'.out ::: *.c
A real life example
Here’s a script I wrote
sample=`/bin/pwd | tr / '\012' | egrep 'DA[0-9]+'`
# Collect all read ids, with > replaced by @
cat 03-panel/out/[0-9]*.fasta | egrep '^>' | sed -e 's/^>/@/' > read-ids
count=1
for dir in ../../2016*/Sample_ESW_*${sample}_*
do
fastq=
for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
do
fastq="$fastq $file"
done
# Pull the FASTQ out for the read ids.
zcat $fastq | fgrep -f read-ids -A 3 | egrep -v -e '^--$' | gzip > run-$count.fastq.gz
count=`expr $count + 1`
done
I made it faster by running a zcat
on each core using parallel:
# Some lines omitted
for dir in ../../2016*/Sample_ESW_*${sample}_*
do
fastq=
for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
do
fastq="$fastq $file"
done
ls $fastq | parallel "(zcat {} | fgrep -f read-ids -A 3 | egrep -v -e '^--$')" | gzip > run-$count.fastq.gz
done
But that’s still inefficient because my main loop waits until each dir
is completely processed (i.e., all its fastq
files have been run). So as the last fastq
files are being processed, cores are unused.
So, faster:
# Some lines omitted
for dir in ../../2016*/Sample_ESW_*${sample}_*
do
fastq=
for file in $dir/03-find-unmapped/*-unmapped.fastq.gz
do
fastq="$fastq $file"
done
echo "zcat $fastq | fgrep -f read-ids -A 3 | egrep -v -e '^--\$' | gzip > run-$count.fastq.gz"
done | parallel
This uses the fact that parallel
will treat its input lines as commands to run (in parallel) if it’s not given an explicit command to run.
And this could have been made faster by using parallel
to run the zcat
.
Counting sequences in FASTQ
I wanted to count the number of nucelotide sequences (billions of them) spread over nearly 6000 FASTQ files (found under directories that start with 20
):
$ find 20* -maxdepth 2 -name '*.fastq.gz' | parallel --plus --bar "zcat {} | egrep -c '^\\+\$' > {..}.read-count"
The --bar
gives a cool progress bar. The --plus
makes {..}
work (to remove two suffixes).
And there’s much more
- Breaking input up by delimiter instead of by line.
- Breaking up input into chunks and passing each chunk to a process that reads from
stdin
(using--pipe
or--pipepart
). - Stop launching jobs after one (or a percent) fail. E.g.,
parallel --halt now,fail=1
- Kill currently running jobs if one fails.
- Resource limiting.
- Resuming failed jobs.
- Retrying failing commands.
- Dry run:
--dry-run
. - Using
tmux
to show output.
Installing
On OS X, if you’re using brew:
$ brew install parallel
More info
- Run
man parallel
(see many examples at bottom). - Parallel tutorial.
- Wikipedia page.
- Some videos.
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.
August 6th, 2017 at 8:19 am
Wow I’ve been ignoring `parallel` too. Small aside: you can set the `globstar` option with `shopt` to enable `**/*.c` in bash, so you don’t have to say things like `*/*/*.c`.
August 11th, 2017 at 4:12 pm
Thanks Jean – I didn’t know about globstar. I don’t use bash any more. I switched to fish a couple of years ago. Maybe fish has a similar thing. Thanks again.