`rapsearch` Returns

Sam — Mon, 01 Jun 2015 11:00:42 +0000

Following completion of my most recent side-quest to find a little more about who the protozoa actually are and where they live in the context of UniProt, I now had a starting point to append to my archive of hydrolase records. I had already shown that around 1,500 Ciliophora-associated hydrolases could be extracted from UniProt, but before continuing my hunt for relevant protozoans, I wanted to run a quick sanity check.

As of May 2015, querying for all records in UniProt (i.e. either the manually-curated SwissProt or automatically-annotated TrEMBL) which are assigned an EC (Enzyme Classification) of 3.* yields just over 1.4M results. Yet in a recent introduction to my newly extracted databases, my bacterial-associated hydrolases totalled (reviewed + unreviewed) 1.1M records — so we’ve pulled out almost 80% of all the hydrolases in UniProt already!

I mentioned previously that we had started using rapsearch as an alternative to BLAST due to its execution speed. In fact rapsearch was capable of searching through all ~700K limpet contig sequences (totalling ~433 megabases) against the largest of the hydrolase databases I had created (~1.1M sequences, ~408 megabases) without the need to shard either the contigs, or the database — in a matter of hours as opposed to weeks (or as it has felt, forever).

Given the primary reason for taking particular taxa-associated records was to reduce cluster time, but clearly we’re able to adequately process the vast majority of records in a more than reasonable time. Thus there doesn’t appear to be a reason against just making a superdatabase of all the hydrolases in UniProt, a “few” extra sequences seems somewhat moot in terms of computational complexity…

Thus I tabulate below the results of executing rapsearch over the limpet contigs for all hydrolases in both SwissProt and TrEMBL:

Database Source	#Records	#Nucleotides	Execution Time (Max. GB RAM)	Raw Hits	Bitscore Filter	Overlap Filter
SwissProt	64,521	26,516,938	0:35:03 (95.37)	604,867	224,394 (37.10%)	13,706 (6.11%, Raw:2.27%)
TrEMBL	1,335,692	545,226,198	3:39:30 (129.32)	1,975,908	979,220 (49.56%)	33,599 (3.43%, Raw:1.70%)
Total	1,400,213	571,743,136	4:14:33 (224.69)	2,580,775	1,203,614 (46.64%)	35,756 (2.98%, Raw:1.39%)

Raiding `rapsearch` Results

Sam — Sat, 09 May 2015 11:00:56 +0000

Finally. After all the trouble I’ve had trying to scale BLAST, running out of disk space, database accounting irregularities and investigating an archive_exception, we have data.

Thanks to the incredible speed of rapsearch, what I’ve been trying to accomplish over the past few months with BLAST has been done in mere hours without the hassle of database or contig sharding. Quantifying the accuracy of rapsearch is still something our team is working on, but Tom’s initial results suggest comparable performance to BLAST. For the time being at least, it means I can get things done.

As previously described, I extracted bacterial, archaeal and fungal associated hydrolases from both the SwissProt (manually curated) and TrEMBL (automatically annotated) databases. The tables below summarise the number of “raw” hits from rapsearch, the number of hits remaining after discarding hits with a bitscore of less than 40¹, followed by the hits remaining after selecting for the “best” hit for cases where hits overlap by 100bp² or more.

SwissProt

Taxa	Raw SP	Bitscore Filter	Overlap Filter
Bacteria [2]	127,403	58,715	1,649
Archaea [2157]	17,141	2,326	341
Fungi [4751]	78,083	34,180	3,222
Total	222,627	95,221 (42.77%)	5,212 (5.47%, Raw:2.34%)

TrEMBL

Taxa	Raw TR	Bitscore Filter	Overlap Filter
Bacteria [2]	683,307	392,791	6,810
Archaea [2157]	79,738	34,950	1,486
Fungi [4751]	345,160	190,936	7,379
Total	1,108,205	618,677 (55.83%)	15,675 (2.53%, Raw:1.41%)

Merged

Taxa	Raw All	Bitscore Filter	Overlap Filter
All	1,330,832	713,898 (53.64%)	12,194 (1.71%, Raw:0.92%)

Initially, I had merged all the hits from both databases and all three taxa together to create a super-hit list; yielding just over 12k reasonable quality (>=40) hits to play with after both filtering steps. However, I became concerned with what I coin overlap loss: a significant number of hits were discarded in overlapping regions. 94.53% and 97.47% of our bitscore filtered hits were lost by overlap for SwissProt and TrEMBL lines respectively!

I suspect due to the condensed nature of the assembly (around 16.7 billion base pairs from the raw reads aligning to an assembly consisting of just 433 million base pairs, an average coverage of ~38x) there is likely to be a lot of overlap occurring in 100bp windows. The question is, how well does the top-ranking hit represent a hydrolase? How is “top-ranking” defined? Let’s read the manual³:

[…] preference is given to the db quality first, than the bit score and finally the lenght of annotation, the one with the highest values is kept

— MGKit filter-gff: Overlap Filtering

Hmm, it sounds as though database quality is considered paramount by the default ranking device, ensuring that SwissProt results take precedence over those from TrEMBL. However what if the SwissProt hit actually has a lower bitscore? What happens? To check, I’ll modify the source⁴ slightly to construct a trivial example below⁵.

# Default discarding function:
# lambda a1, a2: min(a1, a2, key=lambda el: (el.dbq, el.bitscore, len(el)))

# I'll treat a 3-element list as the `el` object, with format:
#    hit = [dbq, bitscore, length]
# Let's re-define the default discarding function to anticipate this list:
discard = lambda a1, a2: min(a1, a2, key=lambda el: (el[0], el[1], el[2]))

# Construct some hits
crap_sp_hit = [10, 39, 100]
good_tr_hit = [8, 60, 100]

# Determine which hit to discard
discard(crap_sp_hit, good_tr_hit)
> [8, 60, 100]

Welp, the “better” TrEMBL originating annotation is discarded. It seems the default selection function ensures database quality above all else. Ideally we’d like a metric that gives weight to sequences held in SwissProt (to reflect their curation accuracy) but not so much that they are always chosen over better hits from a “weaker” database. filter-gff does accept an optional --choose-func parameter which I will now be investigating the behaviour of.

As an aside, it’s interesting to note the very high number of hits for fungal-associated hydrolases, especially after overlap filtering — where for both SwissProt and TrEMBL, it has more results than the bacteria and archaea. I wonder whether this is indicative of the host contamination known to be in the data set, as I guess host associated sequences are more likely to have hits in the fungal database as they are both at least eukaryotic.

Before moving on, I’ll re-run the overlap filtering step with a less naive filtering function and report back. I’m curious to see what other overlap loss is being incurred, are the overlapping hits similar in function and taxonomy, or widely different?

To really consider whether or not the hits are representative of a hydrolase, we need to calculate how much the hit covers the whole database target sequence. It’s all well and good if we have a high bitscoring hit to a hydrolase, but if it only covers a fraction of the whole sequence, that doesn’t necessarily bode well for a “real” hydrolase being on the contig. Unfortunately, the m8/blast6 output formats (such as that of rapsearch) do not give the length of the target sequences, electing to only give the length of the hit region and its identity.

So the next step will be to index the FASTA files used to build the hydrolase databases, then for each hit found by rapsearch: query the FASTA indices for the length of the target hydrolase and work out the proportion of the target covered by the hit. I can then add these values to the GFF files (containing the hits) and refer to them in my own hit discarding function when re-calling filter-gff. Easy.

I’m somewhat confused though, the proportion of a database sequence covered by a hit seems like a common question to determine how “good” a hit really is? I spoke to Tom and he normally defines a good hit by its bitscore and by the number of bases on the hit that actually matched the target sequence exactly (hit length * hit identity), he tells me this is pretty common too.

I guess we want to ensure we’re discovering “real” hydrolases by picking out annotations that cover as much of a known hydrolase gene as possible. But it still seems strange to me that this sort of metric isn’t more commonly used, are we doing something special?

tl;dr

We have data. Already it raises more questions than answers.
MGKit’s filter-gff annotation overlap discarder has a superiority complex causing it to liberally discard annotations from a “weaker” database by default.
You never quite get what you need from a program’s output.