introduction – Samposium https://samnicholls.net The Exciting Adventures of Sam Wed, 13 Jan 2016 00:06:37 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 What am I doing? https://samnicholls.net/2015/04/27/what-am-i-doing/ https://samnicholls.net/2015/04/27/what-am-i-doing/#respond Mon, 27 Apr 2015 10:14:14 +0000 http://blog.ironowl.io/?p=269 A week ago I had a progress meeting with Amanda and Wayne, who make up the supervisory team for the computational face of my project. I talked about how computers are terrible and where the project is heading.

As Wayne had been away from meetings for a few weeks, I began with a roundup of everything that has been going disasterously wrong1. Progress on a functional analysis of the limpet data has been repeatedly hindered by a lack of resources on our cluster which is simply strugging with the sheer size of the jobs I’m asking of it.

The Cluster Conundrum

I’ve encountered two main issues with job size here:

  • Jobs that are large because the inputs are large but few (e.g. assembling raw reads contained in a pair of 42GB files), or
  • Jobs that are large because although the inputs are small (< 100MB), there are thousands of them (e.g. BLAST‘ing large numbers of contigs against a sharded database2)

Small-Big Jobs

The former is somewhat unavoidable. If velvet wants to consume 450GB of RAM for an assembly and we want an assembly specifically from velvet then it’s a case of having to wait patiently for one of the larger nodes to become free enough to schedule the job. Although we could look for other assemblers3 and evaluate their bold claims regarding reduced resource usage over competitors often when we’ve found a tool that just works, we like to keep things that way — especially if we want to be able to compare results of other assemblies that must be manufactured in the same way.

Cluster jobs require resources to be requested up front and guesstimating (even generously) can often lead to a job being terminated for exceeding its allowance, wasting queue time (days) as well as execution time (days or weeks) and leaving you with nothing to show4. The problem is in asking for too much, you queue for a node longer, but when finally scheduled you effectively block others from using resources for a significant time period and I’ll make you feel bad for it.

The only way to get around these constraints is to minimise the dataset you have in the first place. For example for assemblies you could employ:

Normalization : Count appearances of substrings of length k (k-mers) present in the raw reads, then discard corresponding reads in a fashion that retains the distribution of k-mers. Discarding data is clearly lossy, but the idea is that as the distribution of k-mers is represented in the same way but with fewer reads.

Partitioning : Attempt to construct a graph of all k-mers present in the raw reads, then partition it in to a series of subgraphs based on connectivity. Corresponding reads from each partition can then be assembled separately and potentially merged afterwards. Personally I’ve found this method a bit hit and miss so far but would like to have time to investigate further.

Subsampling : Select a more manageable proportion of reads from your dataset at random and construct an assembly. Not only very lossy, this in itself raises some interesting sampling bias issues (to go with your original environment sampling and PCR biases).

Iterative Subsampling : Assemble a subsample from your data set and then align the contigs back to the original raw reads. Re-subsample from all remaining unaligned reads and create a second assembly, repeat the process until you have N different assemblies and are satisfied with the overall alignment (i.e. the set of remaining unaligned reads is sufficiently small). Tom in our lab group has been pioneering this approach and might hopefully give a better explanation of this than I can.

Big-Small Jobs

The latter category is a problem actually introduced by trying to optimise cluster scheduling in the first place. For example, an assembly can produce thousands of contigs groups of reads believed by an assembler to belong together and often we want to know if any interesting known sequences can be found on these contigs. Databases of interesting known sequences are often (very) large and so to avoid submitting an inefficient long-running memory-hogging small-big job to locate thousands of different needles in thousands of different haystacks (i.e. BLAST‘ing many contigs against a large database), we can instead attempt to minimize the size of the job by amortising the work over many significantly smaller jobs.

For the purpose of BLAST5, we can shard both the contigs and the database of interesting sequences in to smaller pieces. This reduces the search space (fewer interesting-sequence needles to find in fewer contig haystacks) and thus execution time and resource requirements. Now your monolith job is represented by hundreds (or thousands) of smaller, less resource intensive jobs that finish more quickly. Hooray!

Until the number of jobs you have starts causing trouble.

Of course this in turn makes handling data for downstream analysis a little more complex, output files need converting, sorting and merging before potentially having to be re-sharded once again to fit them through a different tool.

Conquering Complications

So how can we move forward? We could just do what is fashionable at the moment and write a fantastic new [assembler|aligner|pipeline] that is better and faster6 than the competition, uses next-to-no memory and can even be run on a Raspberry Pi, but this is more than a PhD in itself7, so sadly, I guess we have to make do and stick with what we have and attempt to use it more efficiently.

Digressing, I feel a major problem in bioinformatics software right now is a failure to adequately communicate the uses and effects of parameters: how can end-users of your software fine tune8 controls and options without it feeling like piloting a Soyuz? I think if the learning curve is too great, with understanding hampered further by a lack of tutorials or extensive documentation with examples, users end up driven to roll their own solution. Often in these cases the end result is maintained by a single developer or group, missing out on the benefits of input from the open-source community at large.

Small-Big jobs can currently be tackled with novel methods like Tom’s iterative subsampling as described above, or of course, by adding additional resources (but that costs money).

Some of the risk recently identified with the execution of Big-Small jobs can be reduced by being a little more organised. I’m in the process of writing some software to ease interaction with Sun Grid Engine that now places logs generated during job execution outside of the working directory — reducing some of the I/O load when repeatedly requesting the contents of output directories.

Keeping abreast of the work of others who dared to tread and write their own new assembler, aligner or whatever is important too. Currently we’re testing out rapsearch as an alternative to BLAST simply due to its execution speed (yet another post in itself). BLAST is pretty old and “better” alternatives are known to exist, but it’s still oft-cited and an expected part of analysis in journal papers, so switching out parts of our pipeline for performance is not ideal. At the same time, I actually want to get some work done and right now using BLAST on the dataset I have, with the resources I have is proving too problematic.

At the very least, we can now use rapsearch to quickly look for hits to be analysed further with BLAST if we fear that the community may be put off by our use of “non-standard” software.

Ignoring the Impossible

After trading some graph theory with Wayne in return for some biological terminology, we turned our attention to a broad view of where the project as whole is heading. We discussed how it is difficult to assemble entire genomes from metagenomic datasets due to environmental bias, PCR bias and clearly, computational troubles.

I’d described my project at a talk previously:

[…] it’s like trying to simultaneously assemble thousands of jigsaws but some of the jigsaws are heavily duplicated and some of the jigsaws hardly appear at all, a lot of the pieces are missing and quite a few pieces that really should fit together are broken. Also the jigsaws are pictures of sky.

Lately I’ve started to wonder how this is even possible: how can we state with confidence that we’ve assembled a whole environment? How do we know the initial sample contained all the species? How can we determine what is sequencing error and what is real and rare? How on Earth are we supposed to identify all affinities in variation for all species across millions of reads that are shorter than my average Tweet that barely overlap?

We can’t9.

But that’s ok. That isn’t the project. These sorts of aims are too broad, though that won’t prevent me from trying. Currently I’m hunting for hydrolases (enzymes used to break apart chemical bonds in presence of water), so we can turn the problem on its head a little. Instead of creating an assembly and assigning taxonomic and functional annotations to every single one of the resulting contigs then filtering the results by those that resemble hydrolasic behaviour – treating each contig as equally interesting – we can just look at contigs that contain coding regions for the creation of hydrolases directly! We can use a short-read aligner such as rapsearch or BLAST to search for needles from a hydrolase-specific database of our own construction, instead of a larger, more general bacterial database.

We can then query the assembly for the original raw reads that built the contig on which strong hits for hydrolases appear. We can take a closer look at these reads alone, filtering out whole swathes of the assembly (and thus millions of reads) that are “uninteresting” in terms of our search.

We want to identify and extract interesting enzymes and the sequences that derive them, discovering a novel species in the process is a nice bonus but the protein sequence is the key.


tl;dr

  • My data is too big and my computer is too small.
  • There are big-small jobs and small-big jobs and both are problematic and unavoidable.
  • There just isn’t time to look at everything that is interesting.
  • We need to know the tools we are using inside out and have a very good reason to make our own.
  • We don’t have to care about data that we aren’t interested in.
  • The project probably isn’t impossible.

  1. Which is pretty much anything that involves a computer. 
  2. In an attempt to speed up BLAST queries against large databases we have taken to splitting the database into ‘shards’; submitting a job for each set of contigs against a specific database shard, before cat‘ing all the results together at the end. I call this re-tailing
  3. In fact, currently I’m trying to evaluate MegaHIT
  4. This isn’t always strictly true. For example, aligners can flush output hits to a file as they go along and with a bit of fiddling you can pick up where you left off and cat the outputs together5
  5. Other short-read sequencer aligners are available. 
  6. Bonus points for ensuring it is also harder and stronger. 
  7. I learned from my undergraduate dissertation that no matter how hard you try, the time to investigate every interesting side-street simply does not exist and it’s important to try and stay on some form of track. 
  8. I had a brief discussion about the difficulty of automated parameter selection on Twitter after a virtual conference and this is something I’d like to write more about at length in future:

  9. Probably. 
]]>
https://samnicholls.net/2015/04/27/what-am-i-doing/feed/ 0 269
The Story so Far: Part I, A Toy Dataset https://samnicholls.net/2015/04/21/the-story-so-far-p1/ https://samnicholls.net/2015/04/21/the-story-so-far-p1/#respond Tue, 21 Apr 2015 11:00:23 +0000 http://blog.ironowl.io/?p=126 In this somewhat long and long overdue post; I’ll attempt to explain the work done so far and an overview of the many issues encountered along the way and an insight in to why doing science is much harder than it ought to be.

This post got a little longer than anticipated, so I’ve sharded it like everything else around here.


In the beginning…

To address my lack of experience in handling metagenomic data, I was given a small1 dataset to play with. Immediately I had to adjust my interpretation of what constitutes a “small” file. Previously the largest single input I’ve had to work with would probably have been the human reference genome (GRCh37) which as a FASTA2 file clocks in at around a little over 3GB3.

Thus imagine my dismay when I am directed to the directory of my input data and find 2x42GB file.
Together, the files are 28x the size of the largest file I’ve ever worked with…

So, what is it?

Size aside, what are we even looking at and how is there so much of it?

The files represent approximately 195 million read pairs from a nextgen4 sequencing run, with each file holding each half of the pair in the FASTQ format. The dataset is from a previous IBERS PhD student and was introduced in a 2014 paper titled Metaphylogenomic and potential functionality of the limpet Patella pellucida’s gastrointestinal tract microbiome [PubMed]. According to the paper over 100 Blue-rayed Limpets (Patella pellucida) were collected from the shore of Aberystwyth and placed in to tanks to graze on Oarweed (Laminaria digitata) for one month. 60 were plated, anesthetized and aseptically dissected; vortexing and homogenizing the extracted digestion tracts before repeated filtering and final centrifugation to concentrate cells as a pellet. The pellets were then resuspended and DNA was extracted with a soil kit to create an Illumina paired-end library.

The paper describes the post-sequencing data handling briefly: the net result of 398 million reads were quality processed using fastq-mcf; to remove adaptor sequences, reads with quality lower than Q20 and reads shorter than 31bp. The first 15bp of each read were also truncated5. It was noted the remaining 391 million reads were heavily contaminated with host-derived sequences and thus insufficient for functional analysis.

My job was to investigate to what extent the contamination had occurred and to investigate whether any non-limpet reads could be salvaged for functional analysis.

Let’s take a closer look at the format to see what we’re dealing with.

FASTQ Format

FASTQ is another text based file format, similar to FASTA but also stores quality scores for each nucleotide in a sequence. Headers are demarcated by the @ character instead of > and although not required tend to be formatted strings containing information pertaining to the sequencing device that produced the read. Sequence data is followed by a single + on a new line, before a string of quality scores (encoded as ASCII characters within a certain range, depending on the quality schema used) follows on another new line:

@BEEPBOOP-SEQUENCER:RUN:CELL:LANE:TILE:X:Y 1:FILTERED:CONTROL:INDEX
HELLOIAMASEQUENCEMAKEMEINTOAPROTEINPLEASES
+
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

This example uses Illumina 1.8+ sequence identifiers and quality scores, the same as those found in the dataset. The quality string represents increasing quality from 0 (worst) to 41 (best) left to right. Taking the first read from the first file as an actual example, we get a better look at a “real-world” sequence header:

@HWI-D00173:21:D22FBACXX:5:1101:1806:1986 1:N:0:CTCTCT
TTGTGTCAAAACCGAACAACATGACAATCTTACTTGCCTGGCCCTCCGTCCTGCACTTCTGGCATGGGGAAACCACACTGGGGGC
+
IIIAEGIIIFIIIEGIFFIIIFIFIIEFIIIIEFIIEFGCDEFFFFABDDCCCCCBBBBBBBBBBBBBB?BBBB@B?BBBBBBB5

So how are these files so large6? Given each read record takes four lines (assuming reads short enough to not be spead over multiple lines — which they are not in our case) and each file contains around 195 million reads, we’re looking at 780 million lines. Per file.

The maximum sequence size was 86bp and each base takes one byte to store, as well as corresponding per-base quality information:

86 * 2 * 195000000
> 33540000000B == 33.54GB

Allowing some arbitrary number of bytes for headers and the + characters:

((86 * 2) + 50 + 1) * 195000000
> 43485000000B == 43.49GB

It just adds up. To be exact, both input files span 781,860,356 lines each — meaning around 781MB of storage is used merely for newlines alone! It takes wc 90 seconds to just count lines, these files aren’t small at all!

Quality Control

Although already done (as described by the paper), it’s good to get an idea of how to run and interpret basic quality checks on the input data. I used FASTQC which outputs some nice HTML reports that can be archived somewhere if you are nice and organised.

Top Tip
It’s good to be nice and organised because in writing this blog post I’ve been able to quickly retreive the FASTQC reports from October and realised I missed a glaring problem as well as a metric that could have saved me from wasting time.

Shit happens. Well, it’s good you’re checking, I’m much less organised.

— Francesco

For an input FASTQ file, FASTQC generate a summary metrics table. I’ve joined the two tables generated for my datasets below.

Measure Value (_1) Value (_2)
Filename A3limpetMetaCleaned_1.fastq.trim A3limpetMetaCleaned_2.fastq.trim
File type Conventional base calls Conventional base calls
Encoding Sanger / Illumina 1.9 Sanger / Illumina 1.9
Total Sequences 195465089 195465089
Filtered Sequences 0 0
Sequence length 4-86 16-86
%GC 37 37

Here I managed to miss two things:

  • Both files store the same number of sequences (which is expected as the sequences are paired), something that I apparently forget about shortly…
  • Both files do not contain sequences of uniform length, neither do the non-uniform lengths have the same range, meaning that some pairs will not align to correctly as they cannot overlap fully…

FASTQC also generates some nice graphs, of primary interest, per-base sequence quality over the length of a read:

A3limpetMetaCleaned_1.fastq.trim A3limpetMetaCleaned_2.fastq.trim
pbq1 pbq2

Although made small to fit, both box plots clearly demonstrate that average base quality (blue line) lives well within the “green zone” (binning scores of 28+) slowly declining to a low of around Q34. This is a decent result, although not surprising considering quality filtering has already been performed on the dataset to remove poor quality reads! A nice sanity check nonetheless. I should add that it is both normal and expected for average per-base quality to fall over the length of a read (though this can be problematic if the quality falls drastically) by virtue of the unstable chemistry involved in sequencing.

FASTQC can plot a distribution of GC content against a hypothetical normal distribution, this is useful for genomic sequencing where one would expect such a distribution. However a metagenomic sample will (should) contain many species that may have differing distributions of GC content across their invididual genomes. FASTQC will often raise a warning about the distribution of GC content for such metagenomic samples given a statistically significant deviation from or violation of the theoretical normal. These can be ignored.

Two other tests also typically attract warnings or errors; K-mer content and sequence duplication levels. These tests attempt to quantify the diversity of the reads at hand, which is great when you are looking for diversity within reads of one genome; raising a flag if perhaps you’ve accidentally sequenced all your barcodes or done too many rounds of PCR and been left with a lot of similar looking fragments. But once again, metagenomics violates expectations and assumptions made by traditional single-species genomics. For example, highly represented species ([and|or] fragments that do well under PCR) will dominate samples and trigger apparently high levels of sequence duplication. Especially so if many species share many similar sequences which is likely in environments undergoing co-evolution.

FASTQC also plots N count (no call), GC ratio and average quality scores across whole reads as well as per-base sequence content (which should be checked for a roughly linear stability) and distribution of sequence lengths (which should be checked to ensure the majority of sequences are a reasonable size). Together a quick look at all these metrics should provide a decent health check before moving forward, but they too should be taken with a pinch of salt as FASTQC tries to help you answer the question “are these probably from one genome?”.

Trimming

Trimming is the process of truncating bases that fall below a particular quality threshold at the start and end of reads. Typically this is done to create “better” read overlaps (as potential low-quality base calls could otherwise prevent overlaps that should exist) which can help improve performance of downstream tools such as assemblers and aligners.

Trimming had already been completed by the time I had got hold of the dataset7, but I wanted to perform a quick sanity check and ensure that the two files had been handled correctly8. Blatantly forgetting about and ignoring the FASTQC reports I had generated and checked over, I queried both files with grep:

grep -c '^@' $FILE

196517718   A3limpetMetaCleaned_1.fastq.trim
196795722   A3limpetMetaCleaned_2.fastq.trim

“Oh dear”9, I thought. The number of sequences in both files are not equal. “Somebody bumbled the trimming!”. I hypothesised that low-quality sequences had been removed but the corresponding mate in the other file had not been removed to keep the pairs in sync.

With hindsight, let’s take another look at the valid range of quality score characters for the Illumina 1.8+ format:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
^                              ^         ^
0..............................31........41

The record delimiting character; @ is used to encode Q31. For some reason somebody thought it would be a great idea to make this character available for use in quality strings. I’d been counting records as well as any quality line that happened to begin with an encoded score of Q31, the @.


And so, I unnecessarily launched myself head first in to my first large scale computing problem; given two sets of ~196 million reads which mostly overlap, how can we efficiently find the intersection (and write it to disk)?


tl;dr

  • In bioinformatics, “small” really isn’t small.
  • Try to actually read quality reports, then read them again. Then grab a coffee or go outside and come back and read them for a third time before you do something foolish and have to tell everyone at your lab meeting about how silly you are.
  • Don’t count the number of sequences in a FASTQ file by counting @ characters at the start of a line, it’s a valid quality score for most quality encoding formats.

  1. Now realised to be a complete misnomer, both in terms of size and effort. 
  2. A text based file format where sequences are delimited by > and a sequence name [and|or] description, followed by any number of lines containing nucleotides or amino acids (or in reality, whatever you fancy):

    >Example Sequence Hoot Factor 9 | 00000001
    HELLOIAMASEQUENCE
    BEEPBOOPTRANSLATE
    MEINTOPROTEINS
    >Example Sequence Hoot Factor 9 | 00000002
    NNNNNNNNNNNNNNNNN

    Typically sequence lines are of uniform length (under 80bp), though this is not a requirement of the format. The NCBI suggest formats for the header (single line descriptor, following the > character) though these are also not required to be syntactically valid. 

  3. Stored as text we take a byte for each of the 3 billion nucleotides as well as each newline delimiter and an arbitrary number of bytes for each chromosome’s single line header. 
  4. Seriously, can we stop calling it nextgen yet? 
  5. I’m unsure why, from a recent internal talk I was under the impression we’d normally trim the first “few” bases (1-3bp, maybe up to 13bp if there’s a lot of poor quality nucleotides) to try and improve downstream analysis such as alignments (given the start and end of reads can often be quite poor and not align as well as they should) but 15bp seems excessive. It also appears the ends of the reads were not truncated.

    Update
    It’s possible an older library prep kit was used to create the sequencing library, thus the barcodes would have needed truncating from the reads along with any poor scoring bases.

    The ends of the reads were not truncated as the quality falls inside a reasonable threshold.
     

  6. Or small, depending on whether you’ve adjusted your world view yet. 
  7. I’m unsure why, from a recent internal talk I was under the impression we’d normally trim the first “few” bases (1-3bp, maybe up to 13bp if there’s a lot of poor quality nucleotides) to try and improve downstream analysis such as alignments (given the start and end of reads can often be quite poor and not align as well as they should) but 15bp seems excessive. It also appears the ends of the reads were not truncated.

    Update
    It’s possible an older library prep kit was used to create the sequencing library, thus the barcodes would have needed truncating from the reads along with any poor scoring bases.

    The ends of the reads were not truncated as the quality falls inside a reasonable threshold.
     

  8. Primarily because I don’t trust anyone. Including myself.[^me] 
  9. I’m sure you can imagine what I really said. 
]]>
https://samnicholls.net/2015/04/21/the-story-so-far-p1/feed/ 0 126