Following completion of my most recent side-quest to find a little more about who the protozoa actually are and where they live in the context of UniProt, I now had a starting point to append to my archive of hydrolase records. I had already shown that around 1,500 Ciliophora-associated hydrolases could be extracted from UniProt, but before continuing my hunt for relevant protozoans, I wanted to run a quick sanity check.
As of May 2015, querying for all records in UniProt (i.e. either the manually-curated SwissProt or automatically-annotated TrEMBL) which are assigned an EC (Enzyme Classification) of
3.* yields just over 1.4M results. Yet in a recent introduction to my newly extracted databases, my bacterial-associated hydrolases totalled (
reviewed + unreviewed) 1.1M records — so we’ve pulled out almost 80% of all the hydrolases in UniProt already!
I mentioned previously that we had started using
rapsearch as an alternative to
BLAST due to its execution speed. In fact
rapsearch was capable of searching through all ~700K limpet contig sequences (totalling ~433 megabases) against the largest of the hydrolase databases I had created (~1.1M sequences, ~408 megabases) without the need to shard either the contigs, or the database — in a matter of hours as opposed to weeks (or as it has felt, forever).
Given the primary reason for taking particular taxa-associated records was to reduce cluster time, but clearly we’re able to adequately process the vast majority of records in a more than reasonable time. Thus there doesn’t appear to be a reason against just making a superdatabase of all the hydrolases in UniProt, a “few” extra sequences seems somewhat moot in terms of computational complexity…
Thus I tabulate below the results of executing
rapsearch over the limpet contigs for all hydrolases in both SwissProt and TrEMBL:
|Database Source||#Records||#Nucleotides||Execution Time (Max. GB RAM)||Raw Hits||Bitscore Filter||Overlap Filter|
|SwissProt||64,521||26,516,938||0:35:03 (95.37)||604,867||224,394 (37.10%)||13,706 (6.11%, Raw:2.27%)|
|TrEMBL||1,335,692||545,226,198||3:39:30 (129.32)||1,975,908||979,220 (49.56%)||33,599 (3.43%, Raw:1.70%)|
|Total||1,400,213||571,743,136||4:14:33 (224.69)||2,580,775||1,203,614 (46.64%)||35,756 (2.98%, Raw:1.39%)|