Mysteries – Samposium https://samnicholls.net The Exciting Adventures of Sam Mon, 10 Jan 2022 13:17:22 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 Duplicati: not a valid Win32 FileTime https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/ https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/#respond Mon, 10 Jan 2022 12:47:12 +0000 https://samnicholls.net/?p=2451 I diagnosed this by using the “Live” log, starting the broken backup again and observing the error is thrown after the Backend event: List log entry. I’m using samba to backup parts of my Windows machine to a server in my house using the samba protocol. From the server I listed the target backup directory with ls -lutr (list, using access time, sort and reverse); and immediately noticed several blocks from the last successful backup had an access time from the year 30828! I touched the affected files to give them a more sensible access time and the backup from my Windows machine was able to complete successfully. If I ever discover how this happened in the first place, I will update this post. In the meantime, I hope this is helpful to someone else as searching for help on the issue only raised old problems with local file backups. ]]> https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/feed/ 0 2451 Quick fix for Crashplan Linux segfault https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/ https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/#respond Mon, 20 May 2019 08:32:44 +0000 https://samnicholls.net/?p=2401 Crashplan as your backup solution, it’s likely because it is one of the only companies that make backing up from Linux straightforward (even if they did shaft all their home customers and force them onto business accounts). However part of the Linux Crashplan experience appears to be encountering – from time to time – that the backup engine has stopped working and the application segfaults on start. The segfault is caused by updating glibc, which as a good administrator you are probably doing by occasionally running update on your package manager du jour. The update borks a standard compiled library that the Crashplan electron app needs, namely libnode.so. It was a pain to work this out because the client’s logging is pretty rubbish. Luckily, the fix turns out to be pretty simple. Just install an electron app that isn’t a pile of garbage; like Github’s Atom editor, then copy its libnode.so to your Crashplan’s directory:
cd /usr/local/crashplan/electron
sudo mv libnode.so libnode.so.bork
sudo cp /usr/share/atom/libnode.so .
Of course, your locations may vary, but this will allow the application to start. ]]>
https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/feed/ 0 2401
Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard) https://samnicholls.net/2015/11/11/grokking-gatk/ https://samnicholls.net/2015/11/11/grokking-gatk/#comments Wed, 11 Nov 2015 16:11:50 +0000 https://samnicholls.net/?p=336 The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file <ref>.dict for reference <ref>.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

java -jar ~/git/picard-tools-1.138/picard.jar CreateSequenceDictionary R=<ref>.fa O=<ref>.dict

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov):  A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

java -Djava.io.tmpdir=/not/a/shoebox/ -jar <jar> <tool> ...

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles1. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

$ ulimit -n
1024

$ java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ...

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

java -Xmx:<int>G -jar <jar> <tool> ...

The MalformedReadFilter has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.


tl;dr

  • Your BAM files should end in .bam
  • Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
  • Be indexing, always
  • pysam is a pretty nice package for dealing with SAM/BAM files in Python
  • Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
  • Picard may generously overestimate the number of file handles available
  • GATK is a spoilt child and will have as many file handles as it wants
  • Apply more memory to GATK with java -Xmx:<int>G to avoid running out of heap
  • Remember, the MalformedReadFilter is rather aggressive
  • You need a bigger computer

  1. At the time of writing, 8000. 
]]>
https://samnicholls.net/2015/11/11/grokking-gatk/feed/ 1 336
Secure your Six https://samnicholls.net/2015/07/21/secure-six/ https://samnicholls.net/2015/07/21/secure-six/#respond Tue, 21 Jul 2015 11:00:22 +0000 http://blog.ironowl.io/?p=200 As a financially constrained student, like many others, I use apache‘s support for Server Name Indication (SNI) to serve multiple SSL domains from one IP. I’m somewhat competent and the setup seems to work for all of my domains. Yet, some time ago I tried to access one of my VirtualHosts from work over SSL and was greeted by a fairly standard “invalid certificate” error. A certificate was produced but not for the correct domain.

I had caused this by accident once before, where during a rushed deployment of new SSL keys following Heartbleed, I was literally serving the wrong SSLCertificateFile to clients for that particular VirtualHost. But after triple checking the configuration stanza, everything seemed to be correct in this instance. What’s more is the site was definitely receiving traffic and I could access it outside of work without error.

I dismissed the problem as a quirk of Sanger’s network which has been known to do funny things with web cache in the past, until a bug report from Germany rolled in. The same website was not accessible from their home ISP on the continent.

“It works for me”, I thought, and clearly for the majority of other users too. I could access my other SSL protected domains from both work and so too could our bug reporting German counterpart. I put it down to some weird quirk of Germany.

A little while later, I found that this particular website was still unaccessible from work. Forcing a security exemption, the content downloaded is for the domain the certificate is for1. Now far beyond any reasonable cache time, I figured something must really be wrong.

I scoured access and error logs, trying to find something obvious. I focused on the peculiar nature of how other SSL protected domains worked fine and yet this one did not. I altered the LogFormat to dump more information and finally noticed a discernable difference.

The certificate error only occured when the client had an IPv6 address.

Bollocks. I’d dun goofed the IPv6 configuration, pretty badly. Whilst the server itself is “IPv6 ready”: it can be pinged and the world’s DNS servers know how to reach it over the protocol, I’d never told apache it is expected to be able to serve content over SSL over IPv6.

After all the investigatory effort, the fix just consisted of a minor update the apache ports configuration to add a new NameVirtualHost directive for the server’s IPv6 address on both ports 80 and 443:

[...]
    NameVirtualHost [<IPv6 Address>]:80
    [...]

    <IfModule mod_ssl.c>
        [...]
        NameVirtualHost [<IPv6 Address>]:443
    </IfModule>

…and also to add the IPv6 address2 alongside the IPv4 to each of the VirtualHost:

<VirtualHost <IPv4 Address>:443 [<IPv6 Address>]:443>
        # Some configuration...
    </VirtualHost>

It works!


tl;dr

  • I forgot to configure apache to serve content over IPv6 for SSL traffic, things went wrong.
  • I configured apache with a NameVirtualHost for the server’s IPv6 address and things are no longer wrong.

  1. The typical behaviour of apache not knowing which VirtualHost is supposed to be responding to a request is loading the first one. 
  2. Those square brackets aren’t to be interpreted as “optional”, they are how apache expects an IPv6 address to be formatted. 
]]>
https://samnicholls.net/2015/07/21/secure-six/feed/ 0 200
When `True` is not `True` https://samnicholls.net/2015/06/23/when-true-is-not-true/ https://samnicholls.net/2015/06/23/when-true-is-not-true/#respond Tue, 23 Jun 2015 11:00:53 +0000 http://blog.ironowl.io/?p=209 Today, whilst continuing development on Goldilocks, I discovered a minor oddity that left me a little confused and bemused before lunch: True did not appear to be True

Part of Goldilocks‘ functionality allows for the filtering of results; users may specify a dictionary of criteria whose keys map to functions to be applied to result sets to perform such filtering. For example, start_lte and start_gte both call the following function (with operand set -1 and 1, respectively), filtering out regions whose starting base position is less than or greater than or equal to some value:

def __exclude_start(region_dict, operand, position):
    if operand < 0:
        return region_dict["pos_start"] <= position
    elif operand > 0:
        return region_dict["pos_start"] >= position
    return False

Each of these exclusion checking functions are expected to return True if the criteria for exclusion has been met. Following each different criteria check on the current region, the following mops up to see whether the comparisons can be aborted early:

[...]
elif name == "start_lte":
    ret = __exclude_start(region_dict, -1, to_apply["start_lte"]
[...]

if use_and:
    # Require all exclusions to be true... 
    if ret is False:
        return False
else:
    if ret is True:
        # If we're not waiting on all conditions, we can exclude on the first
        return True

However, during some testing this morning, I noticed spurious results: regions that I expected to be excluded were not. The test suite confirms. I played around with the direction of my angle brackets and switched around True and False to no avail.

I added a simple print statement to __exclude_start, everything appeared to behave as expected – True and False were being printed as one would expect for each pair of positions. Yet the clean-up if ret is True statement was definitely being “ignored”.

print("%d <= %d: %r" % (region_dict["pos_start"], position, region_dict["pos_start"] <= position))
>>> 1 <= 2: True
>>> 2 <= 1: False

Struggling for ideas I thought: maybe I’m not supposed to be using return like that? I reduced the exclusion testing function and “manually” returned True where necessary:

def __exclude_lte(a, b):
    if a <= b:
        return True
    return False

The test suite passes.

I try something else.

def __exclude_lte(a, b):
    return bool(a <= b)

The test suite passes. What weird funky type magic is happening? I’m pretty certain I’m allowed to use return like this and expressions should be automatically bool anyway?

print("%d <= %d: %r (%s)" % (a, b, a <= b, type(a <= b)))
>>> 1 <= 2: True (<type 'numpy.bool_'>)
>>> 2 <= 1: False (<type 'numpy.bool_'>)

Oh crumbs. Now it all makes sense…

Goldilocks makes extensive use of the numpy package (primarily for its nice arrays) which apparently implements its own boolean type that is returned when forming expressions that involve other numpy types, such as int64. In Python, the is operator checks whether or not two variables point at the same object in memory, it does not check for equality. Of course, here: True is not numpy.bool_(True)1 and this is why if ret is True failed and results were not filtered.

Of course, as usual this is all my fault and could have been easily avoided. The anal C programmer in me likes explicit checking of these sorts of things (and is (not) None is a frequent occurrence in my Python scripts) but this whole trouble would have been avoided if I’d just used appropriate Python style and ditched the redundant parts of the clean-up statements anyway:

if use_and:
    # Require all exclusions to be true... 
    if not ret:
        return False
else:
    if ret:
        # If we're not waiting on all conditions, we can exclude on the first
        return True


tl;dr

  • TIL: numpy has its own bool type.
  • One should be careful to remember the difference between testing identity (is) and equality (==) in Python.
  • One should probably be more careful to avoid problems like this in the first place by using the language constructs properly…

  1. Although:

    True is not np.bool(True)
    >>> False

]]>
https://samnicholls.net/2015/06/23/when-true-is-not-true/feed/ 0 209
Playing Phylogenetic Hide and Seek with Protozoa https://samnicholls.net/2015/05/18/playing-phylogenetic-hide-seek/ https://samnicholls.net/2015/05/18/playing-phylogenetic-hide-seek/#respond Mon, 18 May 2015 11:00:54 +0000 http://blog.ironowl.io/?p=213 Amanda suggested that alongside archaeal, bacterial and fungal associated hydrolases, we should also look at protozoans. No problem, I’ll just get the taxonomy ID for protozoa and extract another database from UniProtKB as before. Simple! Or so I thought…

The rabbit hole is pretty deep on this one. Feel free to skip my multi-day exploration in to the history of protozoans and their taxonomy and meet me on the other side.

Classification of protozoa appears to be less clear than I had realised. UniProtKB lists only three taxonomy entries for the term:

  • uncultured rumen protozoa
  • uncultured Canadian Arcott wether rumen protozoa
  • uncultured protist

But none of these are really what I’m looking for. UniProtKB uses these psuedo-species as a catch-all for environmental samples that don’t really fit elsewhere in the database. but I want a high level taxonomic rank like a kingdom that encompasses all of the organisms of interest. So what are the organisms of interest? In an early introduction to my project, I described the protozoa as:

[…] single celled micro-organisms that feed from their direct surroundings and have the capacity for controlled movement with a tendency to thrive in moist environments […]

Yet it seems the answer to “What are protozoa?” boils down to who you ask, or rather, whose interpretation of the taxonomic system you ask.

A Very Brief and Biased History of Modern Taxonomy

This section is mostly an attempt to condense J.M. Scamardella’s 1999 paper: Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista. I won’t attempt to present a full history and instead summarise the origins surrounding the protozoa. For a full, interesting review I suggest you check it out.

In the 18th century, Carl Linnaeus published works on the biological classification of organisms based on their structural appearances, establishing the three kingdoms of Animale (Animal), Vegetabile (Vegetable) and Lapideum (Mineral) which lend themselves to a guessing game of the same name. Linnaeus’ work, particularly on naming strategies for organisms served as a foundation to modern taxonomy.

However, the class system set out by Linnaeus was designed for identification rather than heritage. Following Charles Darwin’s revolutionary On the Origin of Species a century later in 1859, tree representations of species sharing common descent became increasingly common (although his tree diagrams were by no means the first1) and as evolutionary theory developed and took hold, scientists attempted to group available fossil records by structual affinity to link ancestry of common species through the ages. It was here we drew the comparison between contemporary bird species and the dinosaurs.

Meanwhile, micro-organisms exhibiting a variety of characteristics with affinity to both known plants and animals were confusing the topic of taxonomy further…

All Hail the Mighty Kingdom of Protista

Before Darwin, in 1820, the first use of Protozoa appeared in literature from Georg Goldfuss as a class for “first, or early animals” inside the established kingdom of Animalia. Other scientists believed that the protozoa were in fact a phylum, or even a kingdom of their own2.

Just a few years after Darwin’s publication and heavily influenced by evolutionary theory, in 1866 Ernst Haeckel proposed the addition of a third kingdom3, the kingdom of Protista. Not seeing a reason for organisms to be binarily classified as plant or animal, Haeckel hypothesized that organisms which cannot be classed as plant or animal “without manifest coercion” must “have evolved independent of the lineages of the animal and plant kingdoms”4.

Thus, the kingdom of Protista was proposed to contain “doubtful organisms of the lowest rank which display no decided affinities nearer to one side [animals] than to the other [plants]” or organisms possessing “animal and vegetable characters united and mixed”5. Haeckel proposed this as a “kingdom of primitive forms” and included the “Monera” (bacteria) as members of this kingdom too.

It’s clear how such kingdoms are later described in 21st century literature as “a grab-bag for all eukaryotes that are not animals, plants or fungi”6, these early classifications appeared to focus on removing contradicting taxa that were clouding “pure” definitions of what was truly plant and animal and choosing to conviniently classify difficult organisms on what they are not, as opposed to what they are7. The term became a dumping ground for unicellular protozoa that were not quite animals, protophytic algae that were not quite plants and fungal organisms such as slime molds that appeared to be somewhat both (it wouldn’t be until 1969 that Fungi would get a kingdom of their own).

After the invention of the first electron microscope in the early 20th century, confusion was further compounded by the discovery of a distinct cellular nucleus in some unicellular organisms.

The Protozoan Identity Crisis

Although Haeckel later refined his model, adding the now familiar term protozoa as a “sub-kingdom” to his Protista, to represent unicellular animals, another kingdom schema emerged in 1938. Proposed by Herbert Copeland in The Kingdoms of Organisms8, Copeland moved the bacteria and algae out of Haeckel’s Protista kingdom to create a new kingdom: Monera.

Copeland would later re-name his kingdom of Protista to Protoctista (“first established beings”) — a term originally coined by John Hogg. As Haeckel model still persisted, with his Protista kingdom still containing bacteria Copeland deemed it “unfit” to continue using the same name in his model. Selecting a new name, Copeland purposefully avoided the term Protozoa due to both its confusing prior use as a kingdom, class and phylum, but also in agreeance with a point made in Hogg’s 1860 manuscript On the Distinctions of a Plant and an Animal and on a Fourth Kingdom of Nature: that the term “can alone include those that are admitted by all to be animals or ‘zoa'”9. i.e. The term is inappropriate if it is to be applied to non-animals.

The Rise of Superkingdoms

In the 1960s, microbiologist Roger Stanier propagated an observed “fundamental division of life” between the “prokaryotes” and “eukaryotes”, originally noted decades prior by Edouard Chatton in 192510.

Robert Whittaker published his own five kingdom model in 1969 (a revision of his earlier work where he had in fact returned the bacteria to the Protista kingdom, it would take a few more years of research before concluding “this evolutionary divergence in cellular structure [in bacteria] had to be accounted [for]”11), but placing the kingdom Monera beneath a new Superkingdom of Bacteria (placing all the other kingdoms under the new Eukaryotic Superkingdom) and introducing the kingdom of Fungi; a wholly distinct kingdom from Plantae. Whittaker maintained that unicellularity was the most important characteristic for deciding membership of the kingdom of Protista12.

Around the same time biologist Lynn Margulis began working on her own schema for organising life. After several iterations, Margulis relied more on morphologic and structural observations over Whittaker’s strict unicellular criteria and allowed her Protista (later re-named to Protoctista as per Copeland’s model) to contain eukaryotic organisms that were “either unicellular or multicellular that are not plants, animals or fungi”[Emphasis mine]13.

For those still playing along at home, Whittaker’s five-kingdom re-organisation in 1969 left us with this:

Superkingdom Kingdom Description Examples14
Prokaryota Monera “Procaryotic cells, lacking nuclear membranes, plastids, mitochondria, and advanced […] flagella”15 blue-green algae (Cyanophyta), gliding bacteria (Myxobacteriae), “true” bacteria (Eubacteriae)
Eukaryota Fungi Whittaker’s kingdom established as a rejection of “the superficial resemblance of fungi to plants”16 and the observation that nutrition is derived from environmental absorption17 Slime molds, species demonstrating sporing
Protista “Primarily unicellular or colonial-unicellular organisms […] with eucaryotic cells”18 Ciliophora, Sarcodina, Sporozoa, Euglenophyta
Plantae “Multicellular organisms with walled and frequently vacuolate eucaryotic cells and with photosynthetic pigments in plastids”19 Algaes (including red and brown, but in different subkingdoms)
Animalia “Multicellular organisms with wall-less eucaryotic cells lacking plastids and photosynthetic pigments. Nutrition primarily ingestive with digestion in an internal cavity”20 Everything else…

In the early 1980s, John Corliss reviewed the work of both Margulis and Whittaker and tried to solve the question of cellular complexity by instead counting the number of “differentiated, functional tissues” an organism exhibits, rather than just the boolean question of whether or not they are unicellular. Corliss describes plants and animals as having more than one type of tissue, whereas “protists, while showing multicellularity to varying degrees in certain groups […] fail to demonstrate the organization of cells into two or more clearly differentiated tissues”. Though, a criticism of this model is it “overlooks the fact that multicellular, differentiated organisms are known in all four eukaryotic kingdoms” such as cyanobacteria21.

In 1989, Michael Sleigh, in his second edition of Protozoa and other Protists opened the introduction with:

The position [regarding the origins of eukaryotes from prokaryotes] is now clearer, and there is much support for the view that eukaryotes are best divided into four kingdoms: Animalia or multicellular animals […], Plantae or green land plants […], Fungi […] and Protista, comprising eukaryoute groups formerly classified as algae, protozoa and flagellate fungi.

— Michael Sleigh, Protozoa and other Protists (2nd ed.), 1989.

It appears that the scientific community were converging on an agreement that establishment of another kingdom was necessary. But there were different arguments as to how to organise the members of each kingdom and what criteria should be applied for classification.

Yet there was even a difference of opinion between Corliss and Copeland with respect to why a kingdom for protists should exist. Corliss believed that a Protista kingdom should exist only if “major uniqueness” can be determined, rather than classfication based on the absence of functions common in other kingdoms. Copeland argued that a shared lack of features or function is “not a detriment to classification as a coherent grouping”22.

Confusingly, Corliss advocated usage of the term protist to refer to any Protista, regards of cellularity. Yet the term protist was originally defined by C. Clifford Dobell in 1911 to specifically refer to Protista demonstrating a “unicellular type of organization'”23. These sort of disagreements only further cloud my understanding of what the protozoa actually are.

A Very-Very Brief Introduction to Current Taxonomy

Domains and the Sequencing Revolution

With the development of chain-termination DNA sequencing in 1977 by Frederick Sanger, the field of molecular genetics was born. Taxonomy could now be based on differences expressed in the genetic sequences of organisms (particularly in highly conserved subsequences, such as those responsible for constructing ribosomal RNA molecules), rather than by subjective observations of morphology and function.

Indeed in 1990, Carl Woese described “textbook” definitions of the “basic organisation of life” reliant on classical phenotyping as “outmoded” and “misleading”24. Referring to a result from Zuckerkandl and Pauling, Woese states it is clear that “it is at the level of molecules (particularly molecular sequences) that one really becomes privy to the workings of the evolutionary process”, such molecular methods reveal relationships that just cannot be inferred from an organism’s appearance or function25. Certainly there have been many analyses of ribosomal RNA that provide clear evidence for phylogenetic separation of eubacterial, archaebacterial, and eukaryotic organisms.

Woese argued that a phylogenetic system must “first and foremost recognize the primacy of the three groupings, eubacteria, and archaebacteria and eukaryotes”26 above the conventional five kingdoms that had developed over the past few decades that fail to represent an accurate view of the evolutionary relationship between the kingdoms. It was here that Woese proposed a radical change to taxonomy and added the taxonomic rank of Domain, superior to Kingdom.

Whilst Whittaker had previously introduced the concept of a Superkingdom to differentiate between the Monera and the rest of the kingdoms, this distinction merely noted a difference between “bacteria” and “everything else” and didn’t describe the relationship between the kingdoms within. And so, in his 1990 paper: Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, Woese introduced the three domains of life27:

woese-4578-tree

Woese didn’t make suggestions for what kingdoms should exist beneath these domains (other than an outline for the archaea, which I will ignore here) anticipating that “analysis of the Eucarya will preserve the kingdoms Plantae, Animalia, and Fungi […] and will replace Protista with a series of kingdoms corresponding to the various ancient protistan lineages”.

The Rise and Fall of the Nine Kingdoms of Eukaryota

Nine Kingdoms for the eukaryotes,
Eighteen Phyla for the protists,
But we were all deceived, for a definition of the protozoa, still could not be agreed.

In 1981, a decade before the Domain, Thomas Cavalier-Smith published his own kingdom model. Initially sub-dividing the eukaryotes in to nine different groups and breaking Whittaker’s Protista kingdom into three: the Archezoa (without mitochondria), Chromista (featuring chloroplasts with particular chlorophyll in the lumen) and the Protozoa, defined by their phagotrophism28 but still somewhat seeming as a bin for whatever couldn’t fit elsewhere. Just two years later, Cavalier-Smith culled his eukaryotic kingdoms to the Animalia, Plantae, Fungi, Chromista and Protozoa (which now contained the relegated subkingdom Archezoa). Bacteria still had their own kingdom but now also contained the archaebacteria29.

In the following decades, Cavalier-Smith would revise his system multiple times as more genetic sequencing data became available for analysis. By 1993, Kingdom Protozoa had 18 phyla30 and as recently as 2010, Cavalier-Smith significantly revised the ordering of subkingdoms between the Protozoa and Chromista. As you’ll see below, Cavalier-Smith’s work serves as a significant foundation of the organisation of taxonomy today.

Getting the Band Back Together: 21st Century Supergroups

With continuing innovation fueling rapidly moving research, especially in the fields of molecular biology and genetics, more sequences could be analysed than ever before. Scientists had been won over by Woese’s domain model and tried to maintain an underlying kingdom structure that maintained the intregrity of the “tree” model, ensuring branches are monophyletic (i.e. branches contain only descendants of that species).

In 2004, Simpson and Roger published a review article, outlining six Supergroups which I have attempted to condense in to the table below31. If, like me, however, you prefer pretty colour-coordinated diagrams, refer to the figure below the table instead.

Domain Group Description32 Examples33
Bacteria Bacteria Prokaryotic cells lacking a membrane-bound organelles and nucleus
Archaea Archaea Like bacteria but demonstrating more complex RNA polymerases than bacteria (similar to eukaryotes), peptidoglycan does not appear in the cell wall and often appear in extreme environments (e.g. acidophiles, halophiles, hyperthermophiles), methanogens are classified as archaea
Eukaryota Opisthokonta (Cavalier-Smith, 1987) Primarily predatory multicellular organisms “animals and tree fungi as well as several unicellular groups, including the free-living choanoflagellates”
Amoebozoa (Cavalier-Smith, 1998) “Most of the cells that move and feed using broad or finger-like pseudopodia” (“false feet”: temporary microtubule and filament structures), typically “heterotrophs that engulf other cells using their pseudopodia” Classical amoebae and slime moulds
Excavata (Cavalier-Smith, 2002) “unicellular eukaryotes, most of which are heterotrophic flagellates”, “[m]any excavates have greatly modified mitochondria that are not used for oxidative phosphorylation” Various groups of parasitic flagellate protozoa
Rhizaria (Cavalier-Smith, 2002) “unites a wide diversity of free-living unicellular organisms, many of which feed using fine ‘filose’ pseudopodia, together with some fungi-like plant parasites” Protist groups including foraminifera (mostly marine-based shell-building amoeboid protists), radiolaria (ocean based protozoa with mineral-based skeletal structures) and includes “heterotrophic flagellates or amoebae that consume other microbes associated with surfaces”
Chromalveolata Organisms created by secondary endosymbiosis (a eukaryote engulfs and enslaves another eukaryote containing a primary plastid) from a red algae origin dinoflagellates (flagellate protists, mostly marine plankton), cryptophytes (freshwater algae with plastids), haptophytes and stramenopiles (a.k.a. Stramenopiles: algae, giant kelp, diatoms, plankton), alveolates (major grouping of protozoa) and apicomplexa (specialist parasites including plasmodia (causing malaria), toxoplasma, and cryptosporidium)
Archaeplastida (Plantae) Organisms featuring “plastids (chloroplasts) that originated by primary endosymbiosis” (enslavement and genomic reduction of a prokaryotic cell) Land plants, red and green algae and rare microscopic algae called glaucophytes

simpson-rogers-cb-r964

Yet, even now there is no consensus on what supergroups definitely exist, how exactly they are related and what species belong where. As recently as 2007, Burki et al. proposed the SAR Supergroup (which as of this year is now Subkingdom Harosa34), suggesting the Stramenopiles, Alveolata and Rhizaria should be organised beneath one branch together.

Woop Woop. I’m done.

A later paper by Burki introduces the term Megagroup35. At this point, I realised I’ve spent the best part of a week digging for an answer to my classification conundrum that likely does not exist. This is yet another side-quest or interesting avenue that I must turn back on to get back to the research that I should be doing…

Ready or not, here I come!

UniProtKB

That was a lovely, confusing and borderline frustrating insight to the state of taxonomy Sam, but how does this help us get the protozoan-associated hydrolases? Who sets the taxonomic standard for TrEMBL and SwissProt? UniProt’s help documentation quickly offloads the responsibility to the NCBI:

The taxonomy database that is maintained by the UniProt group is based on the NCBI taxonomy database, which is supplemented with data specific to the UniProt Knowledgebase (UniProtKB). While the NCBI taxonomy is updated daily to be in sync with GenBank/EMBL-Bank/DDBJ […]

Accoding to Chapter 4 of the NCBI Handbook, The NCBI Taxonomy Project began in 1991 “to combine the many taxonomies that existed at the time into a single classification that would span all of the organisms represented in any of the GenBank sources databases”. The idea was to solve the problem of duplicated, conflicting and uncomparable taxonomies maintained by the three big sequencing databases: GenBank, EMBL and DDBJ, who were each keeping their own modified classification system originally derived from Los Alamos National Lab by unifying the terminology used.

Six years of curation later, both EMBL and DDBJ switched to adopt the NCBI taxonomic standard; SwissProt followed suit in 2001.

NCBI’s 21 Structures of the Eukaryotes

So, let’s take a look at NCBI’s taxonomy tree. As of May 2015 the following appear as entries beneath the Eukaryota domain superkingdom36 top-level group37:

Name Rank† (38) Unauthoritative Supergroup‡
Alveolata (alveolates) (Superphylum) Chromalveolata
Amoebozoa (Phylum) =Amoebozoa
Apusozoa (Subphylum) ?39
Breviatea (Class) Amoebozoa
Centroheliozoa (centrohelids) (Class Centrohelea) ?40
Cryptophyta (cryptomonads) CLASS (Class Cryptophyceae) Chromalveolata
Euglenozoa (Infrakingdom) Excavata
Fornicata (?) Excavata
Glaucocystophyceae (glaucocystophytes) CLASS (Class Glaucophyceae) Archaeplastida
Haptophyceae (coccolithophorids) (Phylum Haptophyta,
== Class Coccolithophyceae
== Class Prymnesiophyceae)
Chromalveolata
Heterolobosea CLASS Excavata
Jakobida (Order) Excavata
Katablepharidophyta CLASS (Order Katablepharida) Chromalveolata
Malawimonadidae FAMILY Excavata
Opisthokonta (Supergroup) =Opisthokonta
Oxymonadida (oxymonads) ORDER Excavata
Parabasalia (parabasalids) (?) Excavata
Rhizaria (Infrakingdom) =Rhizaria
Rhodophyta (red algae) (Phylum) Archaeplastida
Stramenopiles (heterokonts) (Superphylum Heterokonta
== Supergroup Stramenopiles)
Chromalveolata
Viridiplantae (green plants) KINGDOM (Subkingdom) Archaeplastida
environmental samples N/A N/A
unclassified eukaryotes N/A N/A

Parentheses indicate the entry was unranked by UniProtKB and a rank was interpreted from another source41
There are no references for these classifications, I’ve just tried to make sense of where the most likely supergroup for each taxonomic entry lies, ? indicates Simpson and Roger were unable to fit this entry in their 2004 model.

Confusingly, most of the 21 entries are not even of the same taxonomic rank. The reasoning for so many “unorganised” entries is lost on me and I can’t seem to locate more information on the methods used to organise taxonomy from NCBI. For most of the entries, UniProt’s corresponding metadata does not even contain a taxonomic rank. I’ve thus lazily inferred missing ranks (in parentheses) from WikiSpecies42 Ruggiero et al.43. Regardless, the footer of every page of the NCBI taxonomy database is adorned with:

Disclaimer: The NCBI taxonomy database is not an authoritative source for nomenclature or classification – please consult the relevant scientific literature for the most reliable information.

So perhaps any suggestion that the entries provide anything beyond an agreed set of boxes in which to place organisms in a database is invalid. Though the help pages confirm the schema attempts to maintain a phylogenetic taxonomy, my impression is that the list purposefully contains “lower” ranks such as Class, Order and even Family both to potentially avoid having to keep up with frequent re-shuffles and re-naming of the (Infra-, Sub-, Super-) Groups, Kingdoms and Phyla but also are left broken as a result of some of the changes that have already happened. For example, I can’t find a mention of either the Fornicata or Parabasalia in Ruggiero et al., but one cannot simply expunge records from a widely used taxonomy database without trouble44!

So… What are protozoa?

I… I don’t know.

I think it is safe to conclude that the protozoa (or protozoans) consist of an incredibly diverse group of various micro-organisms. There appears to be wide ranging evidence to support phylogenetic separation of protozoans in to distinct groups (with whatever taxonomic rank one wishes to use), each exhibiting different evolutionary variation in their genomes. Whilst it appears that the term is a valid and widely accepted term for unicellular eukaryotes, given the current state of their broken-up taxonomy (and the ambiguous meaning of the terms in the past), I would discourage use of the terms when it is possible to substitute a more specialised term for organisms of interest.

For example Ciliates have their own phylum Phylum Ciliophora45) and a quick search for Ciliophora-associated hydrolases returns 1,691 records.

…and where the hell are the virii?

Already far outside the scope of what I wanted to read up on, I’ll save you and myself the history of viral classification schemes and the varying opinions still around today. Suffice to say this is unsurprisingly another topic that the community at large has not reached an agreement on. Virii are typically excluded from the “tree of life” and to many are “non-cellular life” (often explained due to their inability to replicate on their own — they must hijack a cell’s machinery to do it for them) and thus have no place in the tree. Some argue that as evidence points to Viruses co-existing with organisms throughout evolutionary history they deserve representation as a supergroup or domain of their own46, indeed viruses are a significant contributor to genetic diversity via their role in horizontal gene transfer (transferring genes from one organism to another (regardless of species) without ‘traditional’ reproduction).

Ultimately whether a virus is truly alive or not is somebody else’s problem, they still possess sequence data and that has to be organised somewhere. The NCBI database has chosen around this by adding a top-level taxonomic entry for “Viruses” with a structure beneath that appears to follow or emulate the Baltimore Classification system; where virii are organised in to one of seven classes based on how they store their genome and their mode of transcription.

Conclusion

The protozoa have had a long and muddled history and my bet is they will continue to do so as further comparative phylogenomics studies are completed. It’s worth noting that phylogenomics is still a field of scientific endeavour in itself, evolving and adapting to new evidence as other fields (including my own) move alongside it. For the purpose of getting on with the study of our own fields it may have been necessary to find a compromise between having the most up-to-date database and having a database that stays constant enough to know where things actually are.

Indeed, on the subject of compromise, Ruggiero et al. describe their work as a “consensus classification“; a “neither phylogenetic nor evolutionary” system that makes “practical compromises” to accomodate many wide ranging opinions and available evidence, primarily with the goal of providing a “backbone” for databases and collections47.

In the end, I guess expecting the NCBI (or anyone else) to hold a concrete consensus on the organisation of life was naive and possibly reflects the fact that I’m still more of a Computer Scientist than Microbiologist. Here was me thinking we’d pretty much classified all of life already!


tl;dr

  • Protozoa appears to have become a somewhat ambiguous and confusing term over the past few centuries and is applied to a vast number of different species which have now been split across the tree of life.
  • I need to go back to original sample and find out what “protozoa” it might contain to try and conduct more specific searches.
  • Even today, we still can’t agree on what to call things and where they belong in a taxonomy, or even how best to present that taxonomy. But that’s just how science works.

“No mouth. No respiration. No entry.”

— Kingdom Animalia Clubhouse Rules, 1860.

6


  1. Mark A. Ragan, Trees and networks before and after Darwin, 2009. 
  2. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  3. Haeckel did not recognise Linnaeus’ “mineral” kingdom as a kingdom of life. 
  4. M Ragan, A third kingdom of eukaryotic life: history of an idea, 1997. 
  5. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  6. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  7. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  8. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  9. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  10. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  11. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  12. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  13. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  14. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  15. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  16. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  17. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  18. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  19. RH Whittaker, New concepts of kingdoms of organisms, 1969. 
  20. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  21. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  22. JM Scamardella, Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista, 1999. 
  23. Carl R. Woese, Otto Kandler and Mark L. Wheelis, Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, 1990. 
  24. Carl R. Woese, Otto Kandler and Mark L. Wheelis, Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, 1990. 
  25. Carl R. Woese, Otto Kandler and Mark L. Wheelis, Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, 1990. 
  26. Carl R. Woese, Otto Kandler and Mark L. Wheelis, Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, 1990. 
  27. T Cavalier-Smith, Eukaryote kingdoms: Seven or nine?, 1981. [Abstract only]. 
  28. T Cavalier-Smith, A revised six-kingdom system of life, 1998. 
  29. T Cavalier-Smith, Kingdom Protozoa and Its 18 Phyla, 1993. 
  30. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  31. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  32. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  33. Michael Ruggiero et al., A Higher Level Classification of All Living Organisms, 2015. 
  34. Fabien Burki et al., Phylogenomics Reshuffles the Eukaryotic Supergroups, 2007. 
  35. I’m unclear on whether the NCBI have even officially adopted either of these two terms. 
  36. With “supergroups” and now “megagroups” being used in modern literature, I should have tread carefully even with the word “group”. But it’s too late for that now. 
  37. Fabien Burki et al., Phylogenomics Reshuffles the Eukaryotic Supergroups, 2007. 
  38. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  39. AG Simpson and AJ Roger, The real ‘kingdoms’ of eukaryotes, 2004. 
  40. Fabien Burki et al., Phylogenomics Reshuffles the Eukaryotic Supergroups, 2007. 
  41. I’m sorry. 
  42. Michael Ruggiero et al., A Higher Level Classification of All Living Organisms, 2015. 
  43. Let’s not forget the confusion that happened last time a database removed a significant number of records without obvious warning. 
  44. Fabien Burki et al., Phylogenomics Reshuffles the Eukaryotic Supergroups, 2007. 
  45. Arshan Nasir, Kyung Mo Kim, and Gustavo Caetano-Anolles, Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya, 2012. 
  46. Fabien Burki et al., Phylogenomics Reshuffles the Eukaryotic Supergroups, 2007. 
]]>
https://samnicholls.net/2015/05/18/playing-phylogenetic-hide-seek/feed/ 0 213
TrEMBLing https://samnicholls.net/2015/04/24/trembling/ https://samnicholls.net/2015/04/24/trembling/#respond Fri, 24 Apr 2015 10:13:21 +0000 http://blog.ironowl.io/?p=261 Something appears amiss with TrEMBL, millions of sequences are “missing”. Where did they go?

At the end of last month, to build a database of bacterial sequences with known hydrolase activity1, I extracted around 2.9 million sequences from UniProtKB/TrEMBL; a popular database which contains sequences that have been automatically annotated and are awaiting manual curation for graduation to the UniProtKB/SwissProt database. It’s important to note that as these annotations have not yet been reviewed they may be less accurate, but it is this same lack of review that allows the database to be so large — making it a useful first port of call when trying to classify your own sequences. Typically we handle the potential for less accurate results by using a more stringent quality threshold (than we would for a manually curated database such as SwissProt, or one we have created in confidence) when filtering alignment hits from software such as BLAST.

It’s good to keep databases up-to-date and so I ran the same query2 against TrEMBL with a view to re-download the resulting FASTA, only to find just shy of 1 million results had been returned — just over a third of the original query a month ago. Wat?

TrEMBL was most recently updated at the start of April3 and the current release notes include the graph below.

entries

Indeed it appears that half of TrEMBL is missing? After my initial panic state, I presumed this must have been a database spring clean to remove similar looking sequences and digging around the FTP repository, my hunch was confirmed in an additional news file:

The UniProt Knowledgebase (UniProtKB) has witnessed an exponential growth in the last few years with a two-fold increase in the number of entries in 2014. This follows the vastly increased submission of multiple genomes for the same or closely related organisms. This increase has been accompanied by a high level of redundancy in UniProtKB/TrEMBL and many sequences are over-represented in the database. […] …we have developed a procedure to identify highly redundant proteomes within species groups using a combination of manual and automatic methods. We have applied this procedure to bacterial proteomes (which constituted 81% of UniProtKB/TrEMBL in release 2015\_03) and sequences corresponding to redundant proteomes (47 million entries) have been removed from UniProtKB. […] From now on, we will no longer create new UniProtKB/TrEMBL records for proteomes identified as redundant.

Personally I would have liked to see this sort of major announcement (and actually a bit more information on “the procedure“) in the release notes4 rather than as an aside stored in an HTML file that I wouldn’t open in my terminal. Though it is amusing that the removal of 47 million entries still wasn’t enough of a story to make it the “Headline” piece of news for the release!

Mystery solved. At least my BLAST jobs have less to hit against now5?


Update


tl;dr

  • TrEMBL v2015_04 features 47 million less sequences than v2015_03, to reduce unnecessary redundancy.

[msn@bert databases]$ python3 ~/scripts/summary_stat_fasta.py uniprot-2015_03_27-ec3-tax2_bacteria.fasta
#NO 2,900,509
MAX 12,374
MIN 8
AVG 362.95
#NT 1,052,747,787


  1. Sequences with an EC Number (Enzyme Classification) of 3.* with taxonomic class Bacteria [2]
  2. The reviewed:no search query limits results from the UniProtKB to just entries found in TrEMBL. 
  3. And I figured this would make a pretty poor April Fool’s joke. 
  4. A large neon sign wouldn’t have gone amiss either. 
  5. I imagine our sys-admin will be pleased to have some scratch space back too. 
]]>
https://samnicholls.net/2015/04/24/trembling/feed/ 0 261