A BLAST puzzle
Re-posted from my internal Cambridge notebook.
I have a strange situation. I have two almost identical BLAST databases (20170821
and 20170824
). They both contain a certain subject sequence LC074724
.
I query them both, using BLAST’s default search, with an identical query, Q
. In one case (20170824
) there is a very strong match but in the other (20170821
) the query is not matched at all.
I’m trying to figure out why.
BLAST version
$ blastn -version
blastn: 2.6.0+
Package: blast 2.6.0, build May 11 2017 22:22:40
The subject sequence is the same in both databases
Both databases contain an identical subject sequence:
$ blastdbcmd -entry LC074724 -db 20170824 | md5
fcfbc167c7f81dfd75aad1bf3db6220b
$ blastdbcmd -entry LC074724 -db 20170821 | md5
fcfbc167c7f81dfd75aad1bf3db6220b
Doing a match
There is a good match of Q
against LC074724
in 20170824
:
$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170824 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724
Q LC074724.1 2581
Note that the bitscore of the LC074724
match is 2581.
But there is no match of LC074724
at all in 20170821
:
$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724
The option -outfmt '6 qaccver saccver bitscore'
tells BLAST to print the query accession number and version, the subject accession and version, and the bitscore of the match.
What are the best hits for 20170821
So what are the bitscores of the Q
matches in 20170821
?
$ blastn -outfmt '6 bitscore' -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | sort -nr | head
3301
2616
2590
2590
2563
2554
2547
2544
2540
2540
The bitscore of Q
against LC074724
in 20170824
(2581) would have placed it in 5th position in the top 10 matches of Q
in 20170821
, but LC074724
is not matched at all in that database! The worst reported match in 20170821
has bitscore 2354.
What’s the difference between the databases?
Get the ids of the subjects in each database
$ blastdbcmd -entry all -db 20170824 | egrep '^>' | cut -c2- | sort > 20170824.ids
$ blastdbcmd -entry all -db 20170821 | egrep '^>' | cut -c2- | sort > 20170821.ids
$ wc -l 2017082?.ids
9754 20170821.ids
9111 20170824.ids
18865 total
20170821
has 643 sequences that are not in 20170824
:
$ comm -23 20170821.ids 20170824.ids | wc -l
643
20170824
has no sequences that are not in 20170821
:
$ comm -13 20170821.ids 20170824.ids | wc -l
0
and the two have 9111 sequences in common:
$ comm -12 20170821.ids 20170824.ids | wc -l
9111
These numbers match with the numbers from wc -l
above.
So 20170821
is just 20170824
with an extra 643 sequences.
In other words, just adding some sequences to 20170824
to produce 20170821
makes one of the top hits completely vanish from the output!
Maybe LC074724 matches, but with a low bit score?
BLAST only shows the top 500 matched subjects by default. Let’s ask it to show more:
$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 -max_target_seqs 10000 > 20170821-10000-matches.txt
And….
$ head 20170821-10000-matches.txt
Q Q 3301
Q X234A 2616
Q CS388973.1 2590
Q DM059402.1 2590
Q LC074724.1 2581
Q KJ843188.1 2576
Q AB697496.1 2576
Q AB697499.1 2576
Q AB697503.1 2576
Q AB697508.1 2576
Holy shit, there it is! And it’s in 5th position, just as it should be (based on the bitscores discussed above), and with exactly the same high score that it receives when matched in the other database (20170824
).
Just to confirm, taking away the -max_target_seqs
option gets no match:
$ blastn -outfmt '6 qaccver saccver bitscore' -db 20170821 -query Q.fasta \
-task blastn -max_hsps 1 | grep LC074724
$
HOW ON EARTH CAN THAT BE HAPPENING?
Three confusingly-named BLAST options
I’ve always found the names and descriptions of the following three BLAST command-line options very confusing.
From the Formatting options section of blastn -help
:
-num_descriptions <Integer, >=0>
Number of database sequences to show one-line descriptions for
Not applicable for outfmt > 4
Default = `500'
* Incompatible with: max_target_seqs
-num_alignments <Integer, >=0>
Number of database sequences to show alignments for
Default = `250'
* Incompatible with: max_target_seqs
And from the Restrict search or results section:
-max_target_seqs <Integer, >=1>
Maximum number of aligned sequences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
From the section names, it seems like only the latter option (-max_target_seqs
) would have any effect on the search, and that the former two are just about what is displayed.
But… using either just -num_descriptions
or -num_alignments
with a big value also gets a match:
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_descriptions 10000 | grep LC074724
LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621 2581 0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 -num_alignments 10000 | grep LC074724
LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621 2581 0.0
>LC074724.1 Hepatitis B virus DNA, complete genome, isolate: p621
So BLAST is apparently finding the LC074724
match even without the high -max_target_seqs
option, but it’s not displaying it unless there is a high -num_alignments
or -num_descriptions
value.
That makes no sense at all. BLAST is finding the match without the -max_target_seqs
option and the match has the 5th highest score, so why is it not being displayed?
I don’t understand the difference between -num_alignments
and -num_descriptions
.
Maybe -num_alignments
is per query and -num_descriptions
is in the overall result set (i.e., across all queries). BLAST seems to use “alignment” to refer to a match between a query and a single subject (and that match may have multiple high-scoring pairs (HSPs), each with its own bit score). But why is there a -num_descriptions
option? I can only think that BLAST is allocating memory to hold subject descriptions and this option is used to allocate some fixed storage for the overall search. So, possibly, if the -num_descriptions
limit is reached during the search, BLAST gives up because it knows it cannot store more subject descriptions. But only setting -num_alignments
also gets me a match, so maybe BLAST silently raises num_descriptions
to be at least as big as num_alignments
. I don’t know.
Just to confirm, when none of the 3 options are given and no -outfmt
option is either, the match is not found:
$ blastn -db 20170821 -query Q.fasta -task blastn -max_hsps 1 | grep LC074724
$
WTF?
I’ve been using BLAST for the last 4 years or so, and have only today realized that something like this could occur.
It’s of course super important.
And despite all the above, I still don’t understand what’s going on.
Here’s a gist that discusses what looks like the same issue. I pasted the above into it.
A blog post on bugs in BLAST and the difficulty of filing them, getting them fixed, etc.
See also this post in which the user adds the -max_target_seqs
option (setting it to 10
) and the the top hit vanishes. That’s the opposite to my case, where adding the option (with a high value) causes a previously very-high-bitscore-but-unmatched (or unshown) sequence to be matched (or shown).
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.