Posted Friday, August 25th, 2017 at 10:25 pm under me.

A BLAST puzzle

Re-posted from my internal Cambridge notebook.

I have a strange situation. I have two almost identical BLAST databases (20170821 and 20170824). They both contain a certain subject sequence LC074724.

I query them both, using BLAST’s default search, with an identical query, Q. In one case (20170824) there is a very strong match but in the other (20170821) the query is not matched at all.

I’m trying to figure out why.

BLAST version

$ blastn -version
blastn: 2.6.0+
 Package: blast 2.6.0, build May 11 2017 22:22:40

The subject sequence is the same in both databases

Both databases contain an identical subject sequence:

$ blastdbcmd -entry LC074724 -db 20170824 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

$ blastdbcmd -entry LC074724 -db 20170821 | md5
fcfbc167c7f81dfd75aad1bf3db6220b

Doing a match

There is a good match of Q against LC074724 in 20170824:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170824 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
Q   LC074724.1  2581

Note that the bitscore of the LC074724 match is 2581.

But there is no match of LC074724 at all in 20170821:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724

The option -outfmt '6 qaccver saccver bitscore' tells BLAST to print the query accession number and version, the subject accession and version, and the bitscore of the match.

What are the best hits for 20170821

So what are the bitscores of the Q matches in 20170821?

$ blastn -outfmt '6 bitscore' -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | sort -nr | head
3301
2616
2590
2590
2563
2554
2547
2544
2540
2540

The bitscore of Q against LC074724 in 20170824 (2581) would have placed it in 5th position in the top 10 matches of Q in 20170821, but LC074724 is not matched at all in that database! The worst reported match in 20170821 has bitscore 2354.

What’s the difference between the databases?

Get the ids of the subjects in each database

$ blastdbcmd -entry all -db 20170824 | egrep '^>' | cut -c2- | sort > 20170824.ids
$ blastdbcmd -entry all -db 20170821 | egrep '^>' | cut -c2- | sort > 20170821.ids
$ wc -l 2017082?.ids
    9754 20170821.ids
    9111 20170824.ids
   18865 total

20170821 has 643 sequences that are not in 20170824:

$ comm -23 20170821.ids 20170824.ids | wc -l
     643

20170824 has no sequences that are not in 20170821:

$ comm -13 20170821.ids 20170824.ids | wc -l
       0

and the two have 9111 sequences in common:

$ comm -12 20170821.ids 20170824.ids | wc -l
    9111

These numbers match with the numbers from wc -l above.

So 20170821 is just 20170824 with an extra 643 sequences.

In other words, just adding some sequences to 20170824 to produce 20170821 makes one of the top hits completely vanish from the output!

Maybe LC074724 matches, but with a low bit score?

BLAST only shows the top 500 matched subjects by default. Let’s ask it to show more:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 -max_target_seqs 10000 > 20170821-10000-matches.txt

And….

$ head 20170821-10000-matches.txt
Q   Q   3301
Q   X234A   2616
Q   CS388973.1  2590
Q   DM059402.1  2590
Q   LC074724.1  2581
Q   KJ843188.1  2576
Q   AB697496.1  2576
Q   AB697499.1  2576
Q   AB697503.1  2576
Q   AB697508.1  2576

Holy shit, there it is! And it’s in 5th position, just as it should be (based on the bitscores discussed above), and with exactly the same high score that it receives when matched in the other database (20170824).

Just to confirm, taking away the -max_target_seqs option gets no match:

$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
  -task blastn -max_hsps 1 | grep LC074724
$

HOW ON EARTH CAN THAT BE HAPPENING?

Three confusingly-named BLAST options

I’ve always found the names and descriptions of the following three BLAST command-line options very confusing.

From the Formatting options section of blastn -help:

-num_descriptions <Integer, >=0>
  Number of database sequences to show one-line descriptions for
  Not applicable for outfmt > 4
  Default = `500'
   * Incompatible with:  max_target_seqs

-num_alignments <Integer, >=0>
  Number of database sequences to show alignments for
  Default = `250'
   * Incompatible with:  max_target_seqs

And from the Restrict search or results section:

-max_target_seqs <Integer, >=1>
  Maximum number of aligned sequences to keep
  Not applicable for outfmt <= 4
  Default = `500'
   * Incompatible with:  num_descriptions, num_alignments

From the section names, it seems like only the latter option (-max_target_seqs) would have any effect on the search, and that the former two are just about what is displayed.

But… using either just -num_descriptions or -num_alignments with a big value also gets a match:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_descriptions 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_alignments 10000 | grep LC074724
LC074724.1  Hepatitis B virus DNA, complete genome, isolate: p621     2581    0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621

So BLAST is apparently finding the LC074724 match even without the high -max_target_seqs option, but it’s not displaying it unless there is a high -num_alignments or -num_descriptions value.

That makes no sense at all. BLAST is finding the match without the -max_target_seqs option and the match has the 5th highest score, so why is it not being displayed?

I don’t understand the difference between -num_alignments and -num_descriptions.

Maybe -num_alignments is per query and -num_descriptions is in the overall result set (i.e., across all queries). BLAST seems to use “alignment” to refer to a match between a query and a single subject (and that match may have multiple high-scoring pairs (HSPs), each with its own bit score). But why is there a -num_descriptions option? I can only think that BLAST is allocating memory to hold subject descriptions and this option is used to allocate some fixed storage for the overall search. So, possibly, if the -num_descriptions limit is reached during the search, BLAST gives up because it knows it cannot store more subject descriptions. But only setting -num_alignments also gets me a match, so maybe BLAST silently raises num_descriptions to be at least as big as num_alignments. I don’t know.

Just to confirm, when none of the 3 options are given and no -outfmt
option is either, the match is not found:

$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | grep LC074724
$

WTF?

I’ve been using BLAST for the last 4 years or so, and have only today realized that something like this could occur.

It’s of course super important.

And despite all the above, I still don’t understand what’s going on.

Here’s a gist that discusses what looks like the same issue. I pasted the above into it.

A blog post on bugs in BLAST and the difficulty of filing them, getting them fixed, etc.

See also this post in which the user adds the -max_target_seqs option (setting it to 10) and the the top hit vanishes. That’s the opposite to my case, where adding the option (with a high value) causes a previously very-high-bitscore-but-unmatched (or unshown) sequence to be matched (or shown).

Comments are closed.