Following completion of my most recent side-quest to find a little more about who the protozoa actually are and where they live in the context of UniProt, I now had a starting point to append to my archive of hydrolase records. I had already shown that around 1,500 Ciliophora-associated hydrolases could be extracted from UniProt, but before continuing my hunt for relevant protozoans, I wanted to run a quick sanity check.

As of May 2015, querying for all records in UniProt (i.e. either the manually-curated SwissProt or automatically-annotated TrEMBL) which are assigned an EC (Enzyme Classification) of 3.* yields just over 1.4M results. Yet in a recent introduction to my newly extracted databases, my bacterial-associated hydrolases totalled (reviewed + unreviewed) 1.1M records — so we’ve pulled out almost 80% of all the hydrolases in UniProt already!

I mentioned previously that we had started using rapsearch as an alternative to BLAST due to its execution speed. In fact rapsearch was capable of searching through all ~700K limpet contig sequences (totalling ~433 megabases) against the largest of the hydrolase databases I had created (~1.1M sequences, ~408 megabases) without the need to shard either the contigs, or the database — in a matter of hours as opposed to weeks (or as it has felt, forever).

Given the primary reason for taking particular taxa-associated records was to reduce cluster time, but clearly we’re able to adequately process the vast majority of records in a more than reasonable time. Thus there doesn’t appear to be a reason against just making a superdatabase of all the hydrolases in UniProt, a “few” extra sequences seems somewhat moot in terms of computational complexity…

Thus I tabulate below the results of executing rapsearch over the limpet contigs for all hydrolases in both SwissProt and TrEMBL:

Database Source #Records #Nucleotides Execution Time (Max. GB RAM) Raw Hits Bitscore Filter Overlap Filter
SwissProt 64,521 26,516,938 0:35:03 (95.37) 604,867 224,394 (37.10%) 13,706 (6.11%, Raw:2.27%)
TrEMBL 1,335,692 545,226,198 3:39:30 (129.32) 1,975,908 979,220 (49.56%) 33,599 (3.43%, Raw:1.70%)
Total 1,400,213 571,743,136 4:14:33 (224.69) 2,580,775 1,203,614 (46.64%) 35,756 (2.98%, Raw:1.39%)