quality control – Samposium https://samnicholls.net The Exciting Adventures of Sam Mon, 18 Jan 2016 12:26:35 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 The Tolls of Bridge Building: Part IV, Mysterious Malformations https://samnicholls.net/2015/08/27/bridgebuilding-p4/ https://samnicholls.net/2015/08/27/bridgebuilding-p4/#comments Thu, 27 Aug 2015 11:00:38 +0000 http://blog.ironowl.io/?p=148 Following a short hiatus on the sample un-improvement job which may or may not have been halted by vr-pipe inadvertently knocking over a storage node at the Sanger Institute, our 837 non-33 jobs burst back in to life only to fall at the final hurdle of the first pipeline of the vr-pipe workflow. Despite my lack of deerstalker and pipe, it was time to play bioinformatics Sherlock Holmes.

Without mercury access to review reams of logs myself, the vr-pipe web interface provides a single sample of output from a random job in the state of interest. Taking its offering, it seemed I was about to get extremely familiar with lanelet 7293_3#8. The error was at least straightforward…

Accounting Irregularities1

... [step] failed because 48619252 reads were generated in the output bam
file, yet there were 48904972 reads in the original bam file at [...]
/VRPipe/Steps/bam_realignment_around_known_indels.pm line 167

The first thing to note is that the error is raised in vr-pipe itself, not in the software used to perform the step in question — which happened to be GATK, for those interested. vr-pipe is open source software, hosted on Github. The deployed release of vr-pipe used by the team is a fork and so the source raising the error is available to anyone with the stomach to read Perl:

my $expected_reads = $in_file->metadata->{reads} || $in_file->num_records;
my $actual_reads = $out_file->num_records;

if ($actual_reads == $expected_reads) {
    return 1;
}
else {
    $out_file->unlink;
    $self->throw("cmd [$cmd_line] failed because $actual_reads reads were generated in the output bam file, yet there were $expected_reads reads in the original bam file");
}

The code is simple, vr-pipe just enforces a check that all the reads from the input file make it through to the output file. Seems sensible. This is very much desired behaviour as we never drop reads from BAM files, there’s enough flags and scores in the SAM spec to put reads aside for one reason or another without actually removing them. So we know that the job proper completed successfully (there was an output file sans errors from GATK itself). So the question now is, where did those reads go?

Though before launching a full scale investiation, my first instinct was to question the error itself and check the numbers stated in the message were even correct. It was easy to confirm the original number of reads: 48,904,972 by checking both the vr-pipe metadata I generated to submit the bridged BAMs to vr-pipe. I ran samtools view -c on the bridged BAM again to be sure.

But vr-pipe‘s standard over-reaction to job failure is nuking all the output files, so I’d need to run the step myself manually on lanelet 7293_3#8. A few hours later, samtools view -c and samtools stats confirmed the output file really did contain 48,619,252 reads, a shortfall of 285,720.

I asked Irina, one of our vr-pipe sorceresses with mercury access to dig out the whole log for our spotlight lanelet. Two stub INFO lines sat right at the tail of the GATK output shed immediate light on the situation…

Expunged Entries

INFO 18:31:28,970 MicroScheduler – 285720 reads were filtered out during the traversal out of approximately 48904972 total reads (0.58%)

INFO 18:31:28,971 MicroScheduler – -> 285720 reads (0.58% of total) failing MalformedReadFilter

Every single of the 285,720 “missing” reads can be accounted for by this MalformedReadFilter. This definitely isn’t expected behaviour, as I said already, we have ways of marking reads as QC failed, supplementary or unmapped without just discarding them wholesale. Our pipeline is not supposed to drop data. The MalformedReadFilter documentation on the GATK website states:

Note that the MalformedRead filter itself does not need to be specified in the command line because it is set automatically.

This at least explains the “unexpected” nature of the filter, but surely vr-pipe encounters files with bad reads that need filtering all the time? My project can’t be the first… I figured I must have missed a pre-processing step, I asked around: “Was I supposed to do my own filtering?”. I compared the @PG lines documenting programs applied to the 33 to see whether they had undergone different treatment, but I couldn’t see anything related to filtering, quality or otherwise.

I escalated the problem to Martin who replied with a spy codephrase:

the MalformedReadFilter is a red herring

Both Martin and Irina had seen similar errors that were usually indicative of the wrong version of vr-pipe [and|or] samtools — causing a subset of flagged reads to be counted rather than all. But I explained the versions were correct and I’d manually confirmed the reads as actually missing by running the command myself, we were all a little stumped.

I read the manual for the filter once more and realised we’d ignored the gravity of the word “malformed”:

This filter is applied automatically by all GATK tools in order to protect them from crashing on reads that are grossly malformed.

We’re not taking about bad quality reads, we’re talking about reads that are incorrect in such a way that it may cause an analysis to terminate early. My heart sank, I had a hunch. I ran the output from lanelet 7293_3#8‘s bridgebuilder adventure through GATK PrintReads, an arbitrary tool that I knew also applied the MalformedReadFilter. Those very same INFO lines were printed.

My hunch was right, the reads had been malformed all along.

Botched Bugfix

I had a pretty good idea as to what had happened but I ran the inputs to brunel (the final step of bridgebuilder) through PrintReads as a sanity check. This proved a little more difficult to orchestrate than one might have expected, I had to falsify headers and add those pesky missing RG tags that plagued us before.

The inputs were fine, as suspected, brunel was the malformer. My hunch? My quick hack to fix my last brunel problem had come back to bite me in the ass and caused an even more subtle brunel problem.

Indeed, despite stringently checking the bridged BAMs with five different tools, successfully generating an index and even processing the file with Picard to mark duplicate reads and GATK to detect and re-align around indels, these malformed reads still flew under the radar — only to be caught by a few lines of Perl that a minor patch to vr-pipe happened to put in the way.

Recall that my brunel fix initialises the translation array with -1:

Initialise trans[i] = -1
The only quick-fix grade solution that works, causes any read on a TID that has no translation to be regarded as “unmapped”. Its TID will be set to “*” and the read is placed at the end of the result file. The output file is however, valid and indexable.

This avoided an awful bug where brunel would assign reads to chromosome 1 if their actual chromosome did not have a translation listed in the user-input translation file. In practice, the fix worked. Reads appeared “unmapped”, their TID was an asterisk and they were listed at the end of the file. The output was viewable, indexable and usable with Picard and GATK, but technically not valid afterall.

To explain why, let’s feed everybody’s favourite bridged BAM 7293_3#8 to Picard ValidateSamFile, a handy tool that does what it says on the tin. ValidateSamFile linearly passes over each record of a BAM and prints any associated validation warnings or errors to a log file2. As the tool moved over the target, an occassional warning that could safely be ignored was written to my terminal3. The file seemed valid and as the progress bar indicated we were nearly out of chromosomes, I braced.

As ValidateSamFile attempted to validate the unmapped reads, a firehose of errors (I was expecting a lawn sprinker) spewed up the terminal and despite my best efforts I couldn’t Ctrl-C to point away from face.

False Friends

There was a clear pattern to the log file. Each read that I had translated to unmapped triggered four errors. Taking just one of thousands of such quads:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Mapped mate should have mate reference name
ERROR: […] Mapped read should have valid reference name
ERROR: […] Alignment start should be 0 because reference name =*.

Now, a bitfield of flags on each read is just one of the methods employed by the BAM format to perform validation and filtering operations quickly and efficiently. One of these bits: 0x4, indicates a read is unmapped. Herein lies the problem, although I had instructed brunel to translate the TID (which refers to the i-th @SQ line in the header) to -1 (i.e. no sequence), I did not set the unmapped flag. This is invalid (Mapped read should have valid reference name), as the read will appear as aligned, but to an unknown reference sequence.

Sigh. A quick hack combined with my ignorance of the underlying BAM format specification was at fault4.

The fix would be frustratingly easy, a one-liner in brunel to raise the UNMAPPED flag (and to be good, a one-liner to unset the PROPER_PAIR flag5) for the appropriate reads. Of course, expecting an easy fix jinxed the situation and the MalformedReadFilter cull persisted, despite my new semaphore knowledge.

For each offending read, ValidateSamFile produced a triplet of errors:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] MAPQ should be 0 for unmapped read.
ERROR: […] Alignment start should be 0 because reference name =*.

The errors at least seem to indicate that I’d set the UNMAPPED flag correctly. Confusingly, the format spec has the following point (parentheses and emphasis mine):

Bit 0x4 (unmapped) is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2 (properly aligned), 0x100 (secondary), and 0x800 (supplemental).

It would seem that canonically, the alignment start (POS) and mapping quality (MAPQ) are untrustworthy on reads where the unmapped flag is set. Yet this triplet appears exactly the same number of times as the number of reads hard filtered by the MalformedReadFilter.

I don’t understand how these reads could be regarded as “grossly malformed” if even the format specification acknowledges the possibility of these fields containing misleading information. Invalid yes, but grossly malformed? No. I just have to assume there’s a reason GATK is being especially anal about such reads, perhaps developers simply (and rather fairly) don’t want to deal with the scenario of not knowing what to do with reads where the values of half the fields can’t be trusted6. I updated brunel to set the alignment positions for the read and its mate alignment to position 0 in retaliation.

I’d reduced the ValidateSamFile erorr quads to triplets and now, the triplets to duos:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Alignment start should be 0 because reference name =*.

The alignment positions are non-zero? But I just said I’d fixed that? What gives?

I used samtools view and grabbed the tail of the BAM, those positions were indeed non-zero. I giggled at the off-by-one error, immediately knowing what I’d done wrong.

The BAM spec describes the alignment position field as:

POS: 1-based leftmost mapping POSition of the first matching base.

But under the hood, the pos field is defined in htslib as a 0-based co-ordinate, because that’s how computers work. The 0-based indices are then converted by just adding 1 whenever necessary. Thus to obtain a value of 0, I’d need to set the positions to -1. This off-by-one mismatch is a constant source of embarrassing mistakes in bioinformatics, where typically 1-based genomic indices must be reconciled with 0-indexed data structures7.

Updating brunel once more, I ran the orchestrating Makefile to generate what I really hope to be the final set of bridged BAMs, ran them through Martin’s addreplacerg subcommand to fill in those missing RG tags and then each went up against each of the now-six final check tools (yes, they validated with ValidateSamFile). I checked index generation, re-ran a handful of sanity checks (mainly ensuring we hadn’t lost any reads), re-generated the manifest file before finally asking Irina for what I really hope to be the last time, to reset my vr-pipe setup.

I should add, Martin suggested that I use samtools fixmate, a subcommand designed for this very purpose. The problem is, for fixmate to know where the mates are, they must be adjacent to each-other; that is, sorted by name and not by co-ordinate. It was thus cheaper both in time and computation to just re-run brunel than to name-sort, fixmate and coord-sort and current bridged BAMs.

Victorious vr-pipe Update: Hours later

Refreshing the vr-pipe interface intermittently, I was waiting to see a number of inputs larger than 33 make it to the end of the first pipeline of the workflow: represented by a friendly looking green progress bar.

Although the number currently stands at 1, I remembered that all of the 33 had the same leading digit and that for each progress bar, vr-pipe will offer up metadata on one sample job. I hopefully clicked the green progress bar and inspected the metadata, the input lanelet was not in the set of the 33 already re-mapped lanelets.

I’d done it. After almost a year, I’ve led my lanelet Lemmings to the end of the first level.


tl;dr

  • GATK has an anal, automated and aggressive MalformedReadFilter
  • Picard ValidateSamFile is a useful sanity check for SAM and BAM files
  • Your cleverly simple bug fix has probably swapped a critical and obvious problem for a critically unnoticeable one
  • Software is not obliged to adhere to any or all of a format specification
  • Off by one errors continue to produce laughably simple mistakes
  • I should probably learn the BAM format spec inside out
  • Perl is still pretty grim

  1. Those reads were just resting in my account! ↩
  2. Had I known of the tool sooner, I would have employed it as part of the extensive bridgebuilder quality control suite. ↩
  3. Interestingly, despite causing jobs to terminate, a read missing an RG tag is a warning, not an error. ↩
  4. Which only goes to further my don’t trust anyone, even yourself mantra. ↩
  5. Although not strictly necessary as per the specification (see below), I figured it was cleaner. ↩
  6. Though, looking at the documentation, I’m not even sure what aspect of the read is triggering the hard filter. The three additional command line options described don’t seem to be related to any of the errors raised by ValidateSamFile and there is no explicit description of what is considered to be “malformed”. ↩
  7. When I originally authored Goldilocks, I tried to be clever and make things easier for myself, electing to use a 1-based index strategy throughout. This was partially inspired by FORTRAN, which features 1-indexed arrays. In the end, the strategy caused more problems than it solved and I had to carefully had to tuck my tail between my legs and return to a 0-based index↩
]]>
https://samnicholls.net/2015/08/27/bridgebuilding-p4/feed/ 3 148
The Tolls of Bridge Building: Part III, Sample (Un)Improvement https://samnicholls.net/2015/07/31/bridgebuilding-p3/ https://samnicholls.net/2015/07/31/bridgebuilding-p3/#comments Fri, 31 Jul 2015 11:00:53 +0000 http://blog.ironowl.io/?p=165 Previously, on Samposium: I finally had the 870 lanelets required for the sample improvement process. But in this post, I explain how my deep-seated paranoia in the quality of my data just wasn’t enough to prevent what happened next.

I submitted my 870 bridged BAMs to vr-pipe, happy to essentially be rid of having to deal with the data for a while. vr-pipe is a complex bioinformatics pipeline that in fact consists of a series of pipelines that each perform some task or another with many steps required for each. The end result is “improved sample” BAMs, though perhaps due to the nature of our inclusion of failed lanelets we should title them “unimproved sample” BAMs. Having just defeated the supposedly automated bridging process was pretty happy to not have to be doing this stuff manually and could finally get on with something else for a while… Or so I thought.

A Quiet Weekend

It was Friday and I was determined to leave on-time to see my family in Swansea at a reasonable hour1. After knocking up a quick Python script to automatically generate the metadata file vr-pipe requires, describing the lanelets to be submitted and what sample they should be included in the “improvement” of, I turfed the job over to Micheal who has fabled mercury access2 and could set-up vr-pipe for me.

Being the sad soul that I am, I occasionally checked in on my job over the weekend via the vr-pipe web interface, only to be confused by the apparent lack of progress. The pipeline appeared to just be juggling jobs around various states of pending. But without mercury access to inspect more, I was left to merely enjoy my weekend.

Between trouble on the server farm and the fact that my job is not particularly high priority, the team suggested I be more patient and so I gave it a few more days before pestering Josh to take a look.

Pausing for Permissions

As suspected, something had indeed gone wrong. Instead of telling anybody, vr-pipe sat on a small mountain of errors, hoping the problem would just go away. I’ve come to understand this is expected behaviour. Delving deep in to the logs revealed the simple problem: vr-pipe did not have sufficient write permissions to the directory I had provided the lanelet files in, because I didn’t provide group-write access to it.

One chmod 775 . later and the pipeline burst in to life, albeit very briefly, before painting the vr-pipe web interface bright red. Evidently, the problem was more serious.

Sorting Names and Numbers

The first proper step for vr-pipe is creating an index of each input file with samtools index. Probably the most important thing to note for an index file, is that to create one, your file must be correctly sorted. Micheal checked the logs for me and found that the indexing job had failed on all but 33 (shocker) of the input files, complaining that they were all unsorted.

But how could this be? brunel requires sorted input to work, our orchestrating Makefile takes care of sorting files as and when needed with samtools sort. There must be some mistake!

I manually invoked samtools index on a handful of my previous bridged BAMs and indeed, they are unsorted. I traced back through the various intermediate files to see when the sort was damaged before finally referring to the Makefile. My heart sank:

[...]
# sort by queryname
%.queryname_sort.bam: LSF_MEM=10000
%.queryname_sort.bam: LSF_CPU=4
%.queryname_sort.bam: %.bam
${SAMTOOLS_BIN} sort -@ ${LSF_CPU} -n -T ${TMP_DIR}$(word 1,$+) -O bam -o $@ $<

# sort by coordinate position
%.coordinate_sort.bam: LSF_MEM=10000
%.coordinate_sort.bam: LSF_CPU=4
%.coordinate_sort.bam: %.bam
${SAMTOOLS_BIN} sort -@ ${LSF_CPU} -n -T ${TMP_DIR}$(word 1,$+) -O bam -o $@ $<
[...]

samtools sort accepts an -n flag, to sort by query name, rather than the default co-ordinate position (i.e. chromosome and base position on the reference sequence). Yet somehow the Makefile had been configured to use query name sorting for both. I knew I was the last one to edit this part of the file, as I’d altered the -T argument to prevent the temporary file clobbering discovered in the last episode.

Had I been lazy and naughty and copied the line from one stanza to the next? I was sure I hadn’t. But the knowing-grin Josh shot at me when I showed him the file, had me determined to try and prove my innocence. Although, it should be first noted that a spot in a special part of hell must be first reserved for the both of us, as the Makefile was not under version control.

I’d squirrelled away many of the original files from 2014, including the intermediates and selecting any co-ordinate sorted file yielded a set of query name sorted reads. My name was cleared! Of course, whilst this was apparently Josh’s mistake, it’s not entirely fair to point the finger given I never noticed the bug despite spending more time nosing around the Makefile than anyone else. As mentioned, I’d even obliviously edited right next to the extraneous -n in question.

But I am compelled to point the finger elsewhere: brunel requires sorted inputs to work correctly, else it creates files that clearly can’t be used in the sample (un)improvement pipeline! How was this allowed to happen?

Sadly, it’s a matter of design. brunel never anticipated that somebody might attempt to provide incorrect input and just gets on with its job regardless. Frustratingly, had this been a feature of brunel, we’d have caught this problem a year ago on the first run of the pipeline.

The co-ordinate sorting step does at least immediately precede invocation of brunel, which is relatively fast. So after correcting the Makefile, nuking the incorrectly sorted BAMs and restarting the pipeline, it wasn’t a long wait before I was ready to hassle somebody with mercury access to push the button.

Untranslated Translations

Before blindly resubmitting everything, I figured I’d save some time and face by adding samtools index to the rest of my checking procedures, to be sure that this first indexing step would at least work on vr-pipe.

The indexing still failed. The final lanelet bridged BAMs were still unsorted.

Despite feeding our now correctly co-ordinate sorted inputs to brunel, we still get an incorrectly sorted output — clearly a faux pas with brunel‘s handling of the inputs. Taking just one of the 837 failed lanelets (all but the already mapped 33 lanelets failed to index) under my wing, I ran brunel manually to try and diagnose the problem.

The error is a little different from the last run, whereas before samtools index complained about co-ordinates appearing out of order, this time, chromosomes appear non-continuously. Unfortunately, several samtools errors still do not give specific information and this is one of them. I knew that somewhere a read, on some chromosome appears somewhere it shouldn’t, but with each of these bridged BAMs containing tens of millions of records, finding it manually could be almost impossible.

I manually inspected the first 20 or so records in the bridged BAM, all the reads were on TID 1, as expected. The co-ordinates were indeed sorted, starting with the first read at position 10,000.

10,000? Seems a little high? On a hunch I grepped the bridged BAM to try and find a record on TID 1 starting at position 0:

grep -Pn -m1 "^[A-z0-9:#]*\t[0-9]*\t1\t1\t" <(samtools view 7500_7#8.GRCh37-hs37d5_bb.bam)

I got a hit. Read HS29_07500:7:1206:5383:182635#8 appears on TID 1, position 1. Its line number in the bridged BAM? 64,070,629. There’s our non-continuous chromosome.

I took the read name and checked the three sorted inputs to brunel. It appears in the “unchanged” BAM. You may recall these “unchanged” reads are those that binnie deemed as not requiring re-mapping to the new reference and can effectively “stay put”. The interesting part of the hit? In the unchanged BAM, it doesn’t appear on chromosome 1 at all but on “HSCHRUN_RANDOM_CTG19”, presumably some form of decoy sequence.

This appears to be a serious bug in brunel. This “HSCHRUN_RANDOM_CTG19” sequence (and presumably others) seem to be leaving brunel as aligned to chromosome 1 in the new reference. Adding some weight to the theory, the unchanged BAM is the only input that goes through translation too.

Let’s revisit build_translation_file, you recall that the i-th SQ line of the input BAM — the “unchanged” BAM — is mapped to the j-th entry of the user-provided translation table text file. The translation itself is recorded with trans[i] = j where both i and j rely on two loops meeting particular exit conditions.

But note the while loop:

int counter = file_entries;
[...]
while (!feof(trans_file) && !ferror(trans_file) && counter > 0) {
    getline(&linepointer, &read, trans_file);
    [...]
    trans[i] = j;
    counter--;
}
[...]

This while loop, that houses the i and j loops, as well as the assignment of trans[i] = j, may not run for as many SQ lines (file_entries) found in the unchanged BAM, as stored in counter, if the length of the trans_file is shorter than the number of file_entries. Which in our case, it is — as there are only entries in the translation text file for chromosomes that actually need translating (Chr1 -> 1).

Thus not every element in trans is set with a value by the while loop. Though we’ve already seen where there is no translation to be made, this doesn’t work either.

This is particularly troubling, as the check for whether a translation should be performed relies on the default value of trans being NULL. As no default value is set and 0 is the most likely value to turn up in the memory allocated by malloc, the default value for trans[i] for all i, is 0. Or in brunel terms, if I can’t find a better translation, translate SQ line i, to the 0th line (first chromosome) in the user translation file.

Holy crap. That’s a nasty bug.

As described in the bug report, I toyed with quick fixes based on initialising trans with default values:

  • Initialise trans[i] = i
    This will likely incorrectly label reads as aligning to the i-th SQ line in the new file. If the number of SQ lines in the input is greater than that of the output, this will also likely cause samtools index to error or segfault as it attempts to index a TID that is outside the range of SQ.
  • Initialise trans[i] = NULL
    Prevents translation of TID but actually has the same effect as above
  • Initialise trans[i] = -1
    The only quick-fix grade solution that works, causes any read on a TID that has no translation to be regarded as “unmapped”. Its TID will be set to “*” and the read is placed at the end of the result file. The output file is however, valid and indexable.

In the end the question of where these reads should actually end up is a little confusing. Josh seems to think that the new bridged BAM should contain all the old reads on their old decoy sequences if a better place in the new reference could not be found for them. In my opinion, these reads effectively “don’t map” to hs37d5 as they still lie on old decoy sequences not found in the new reference, which is convinient as my trans[i] = -1 initialisation marks all such reads as unmapped whilst also remaining the most simple fix to the problem.

Either way, the argument as to what should be done with these reads is particularly moot for me and the QC study, because we’re only going to be calling SNPs and focussing on our Goldilocks Region on Chromosome 3, which is neither a decoy region or Chromosome 1.

Having deployed my own fix, I re-ran our pipeline for what I hoped to be the final time, re-ran the various checks that I’ve picked up along the way (including samtools index) and was finally left with 870 lanelets ready for unimprovement.

Vanishing Read Groups

Or so I thought.

Having convinced Micheal that I wouldn’t demand he assume the role of mercury for me at short notice again, he informed me that whilst all my lanelets had successfully passed the first step of the first pipeline in the entire vr-pipe workflow — indexing. All but (you guessed it), 33 lanelets, failed the following step. GATK was upset that some reads were missing an RG tag.

Sigh. Sigh. Sigh. Table flip.

GATK at least gave me a read name to look up with grep and indeed, these reads were missing their RG tag. I traced the reads backward through each intermediate file to see where these tags were lost. I found that these reads had been binned by binnie as requiring re-mapping to the new reference with bwa. The resulting file from bwa was missing an @RG line and thus each read had no RG tag.

Crumbs. I hit the web to see whether this was anticipated behaviour. The short answer was yes, but the longer answer was “you should have used -r to provide bwa with an RG line to use”. Though I throw my hands up in the air a little here to say “I didn’t write the Makefile and I’ve never used bwa“.

Luckily, Martin has recently drafted a pull request to samtools for a new subcommand: addreplacerg. Its purpose? Adding and replacing RG lines and tags in BAM files. More usefully, at least to us, its default3 it offers an operation “mode” to tag “orphan” records (reads that have no RG line — exactly the problem I am facing) with the first RG line found in the header.

Perfect. I’ll just feed each final bridged BAM I have to samtools addreplacerg and maybe, just maybe, we’ll be able to pull the trigger on vr-pipe for the final time.

I hope.

Queue Queue Update: 1 day later

For those still following at home, my run of samtools addreplacerg seemed to go without a hitch, which in retrospect should have been suspicious. I manually inspected just a handful of the files to ensure both the “orphaned” reads now had an RG tag (and more specifically that it was the correct and only RG line in the file) and that the already tagged reads had not been interfered with. All seemed well.

After hitting the button on vr-pipe once more, it took a few hours for the web interface to catch up and throw up glaring red progress bars. It seems the first step – the BAM indexing was now failing? I had somehow managed to go a step backwards?

The bridged BAMs were truncated… Immediately I begun scouring the source code of samtools addreplacerg before realising in my excitement I had skipped my usual quality control test suite. I consulted bhist -e, a command to display recently submitted cluster jobs that had exited with error and was bombarded with line after line of addreplacerg job metadata. Inquiring specifically, each and every job in the array had violated its run time limit.

I anticipated addreplacerg would not require much processing, it just iterates over the input BAM, slapping an RG sticker on any record missing one and throws it on the output pile. Of course, with tens of millions of records per file even the quickest of operations can aggregate into considerable time. Thus placing the addreplacerg jobs on Sanger’s short queue was clearly an oversight of mine.

I resubmitted to the normal queue which permits a longer default run time limit and applied quality control to all files. We had the green light to re-launch vr-pipe.

Conquering Quotas Update: 4 days later

Jobs slowly crawled through the indexing and integrity checking stages and eventually began their way through the more time-consuming and intensive GATK indel discovery and re-alignment steps, before suddenly and uncontrollably failing in their hundreds. I watched a helpless vr-pipe attempt to resuscitate each job three times before calling a time of death on my project and shutting down the pipeline entirely.

Other than debugging output from vr-pipe scheduling and submitting the jobs, the error logs were empty. vr-pipe had no idea what was happening, and neither did I.

I escalated the situation to Martin, who could dig a little deeper with his mercury mask on. The problem appeared somewhat more widespread than just my pipeline; in fact all pipelines writing to the same storage cluster had come down with a case of sudden unexpected rapid job death. It was serious.

The situation: pipelines are orchestrated by vr-pipe, which is responsible for submitting jobs to the LSF scheduler for execution. LSF requires a user to be named as the owner of a job and so vr-pipe uses mercury. I am unsure whether this is just practical, or whether it is to ensure jobs get a fair share of resources by all having the same owner though I suspect it could just be an inflexibility in vr-pipe. The net result of jobs being run as mercury is that every output file is also owned by mercury. The relevance of this is that every storage cluster has user quotas, an upper bound on the amount of disk space files owned by that user may occupy before being denied write access.

Presumably you can see where this is going. In short, my job pushed mercury 28TB over-quota and so disk write operations failed. Jobs, unable to write to disk aborted immediately but the nature of the error was not propagated to vr-pipe.

Martin is kindly taking care of the necessary juggling to rebalance the books. Will keep you posted.


tl;dr


  1. Unfortunately somebody towing a caravan decided to have their car burst in to flame on the southbound M11, this and several other incidents turned a boring five hour journey in to a nine hour tiresome ordeal. 
  2. Somewhat like a glorified sudo for humgen projects. 
  3. Turns out, its default mode is overwrite_all
  4. Honestly, it doesn’t matter how trivial the file is, if it’s going to be messed around with frequently, or is a pinnacle piece of code for orchestrating a pipeline, put it under version control. Nobody is going to judge you for trivial use of version control. 
]]>
https://samnicholls.net/2015/07/31/bridgebuilding-p3/feed/ 1 165
The Tolls of Bridge Building: Part II, Construction https://samnicholls.net/2015/07/23/bridgebuilding-p2/ https://samnicholls.net/2015/07/23/bridgebuilding-p2/#comments Thu, 23 Jul 2015 11:00:47 +0000 http://blog.ironowl.io/?p=174 Last time on Samposium, I gave a more detailed look at the project I’m working on and an overview of what has been done so far. We have 870 lanelets to pre-process and improve into samples. In this post, I explain how the project has turned into a dangerous construction site.

While trying to anticipate expected work to be done in a previous post, I wrote:

I also recall having some major trouble with needing to re-align the failed samples to a different reference: these samples having failed, were not subjected to all the processing of their QC approved counterparts, which we’ll need to apply ourselves manually, presumably painstakingly.

‘Painstakingly’ could probably be described as an understatement now. For the past few weeks I’ve spent most of my time overseeing the construction of bridges.

Bridgebuilding

References

Pretty much one of the first stops for genomic data spewed out of the instruments here is an alignment to a known human reference. It’s this step that allows most of our downstream analysis to happen, we can “pileup” sequenced reads where they belong along the human genome, compare them to each-other or other samples and attempt to make inferences about where variation lies and what its effects may be.

Without a reference — like the metagenomic work I do back in Aberystwyth — one has to turn to more complicated de novo assembly methods and look for how the reads themselves overlap and line up in respect to eachother rather than a reference.

Releases of the “canonical” human reference genome are managed by the Genome Reference Consortium, of which the Wellcome Trust Sanger Institute is a member. The GRC released GRCh37 in March 2009, seemingly also known as hg19. It is to this reference that our data was aligned to. However, to combat some perceived issues with GRCh37, the 1000 Genomes Project team constructed their own derivative reference: hs37d5.

To improve accuracy and reduce mis-alignment, hs37d5 includes additional subsequences termed “decoy sequences”. These decoys prevent reads whose true origin is not adequately represented in the reference from mapping to an arbitrarily similar looking location somewhere else in the genome. Sequences of potential bacterial or viral contaminants, such as Epstein-Barr virus which is used in immortalisation of some types of cell line are also included to prevent contaminated reads incorrectly aligning to the human genome.

The benefits are obvious, fewer false positives and more mapped sequence for the cost of a re-alignment? Understandably, our study decided to remap all good lanelets from GRCh37 to hs37d5 and re-run the sample improvement process.

Recipe

Of course, to execute our quality control pipeline fairly, we too must remap all of our lanelets to hs37d5. Luckily, Josh and his team had already conuured some software to solve this problem in the form of bridgebuilder whose components and workflow are modelled in the diagram below:

bridge_builder_v1

bridgebuilder has three main components summarised thus:

  • baker
    Responsible for the generation of the “reference bridge”, mapping the old reference to the new reference. In our case, the actual GRCh37 reference itself is bridged to hs37d5. The result is metaphorically a collection of bridges representing parts of GRCh37 and their new destination somewhere in hs37d5.
  • binnie
    Takes the original lanelet as aligned to the old reference and using the baker‘s reference bridge works out whether reads can stay where they are, or whether they fall in genomic regions covered by a bridge. For each of the 870 lanelets, the result is the population of three bins:
    Unbridged reads
    Reads whose location in hs37d5 is the same as in GRCh37 and do not require re-mapping.
    Bridged reads
    Reads that do fall on a bridge and are mapped across from GRCh37 to hs37d5.
    Unmapped reads that are now mapped
    Reads that did not have an alignment to GRCh37 but now align to hs37d5.
  • brunel
    Takes the binnie bins as “blueprints” and merges all reads to generate the new lanelet.

Whilst designed to function together and co-existing in the same package, bridgebuilder is not currently a push button and acquire bridges process. Luckily for me, back in 2014 while I was frantically finishing the write-up of my undergraduate thesis, Josh had quickly prepared a prototype Makefile with the purpose of orchestrating our construction project.

This was neat, as it accepted a file of filenames (a “fofn”) and automatically took care of each step and all of the many intermediate files inbetween. More importantly it provided a wrapper for handling submission of jobs to the compute cluster (the “farm”). The latter was especially great news for me, having spent most of my first year battling our job management system back in Aberystwyth.

In theory remapping all 870 lanelets should have been accomplished through one simple submission of Make to the farm. As you may have guessed by my tone, this was not the case.

Ruins

Shortly before calling it a day back in 2014, Josh gave the orchestrating Makefile a few shots. The results were hit and miss, a majority of the required files failed to generate and the “superlog” containing over a million lines aggregated stdout from each of the submitted jobs was a monolithic mish-mash of unnavigable segmentation faults, samtools errors, debugging and scheduling information. Repeated application of the Makefile superjob yielded a handful more desired files and the errors appeared to tail off, yet we were still missing several hundred of the final lanelets. Diagnosing where the errors were arising was particular problematic due to two assumptions in the albeit rushed design of our Makefile:

  • Error Handling
    The Makefile did not expect to have to handle the potential for error between one step and the next. Many but by no means all of the errors pumped to the superlog pertained to the absence of intermediate files, rather than a problem with execution of the commands themselves.
  • Error Tracing
    There was no convenient mechanism to trace back entries from the superlog to the actual cluster job (and thus particular lanelet) to try and reproduce and investigate errors. The superlog was just a means of establishing a heartbeat on the superjob and to roughly guage progress.

This is where the project left off and to where we must now return.

Revisiting Wreckage

As expected, given nothing had been messed around with in the project’s year long hiatus, rebooting the bridgebuilding pipeline via the Makefile provided similar results as before: some extra files and a heap of errors of unknown job origin. However, given there was non-trivial set of files that the pipeline marked as complete, there were less files to process overall and so many less jobs were submitted as a result, thus inspecting the superlog was somewhat easier.

I figured a reasonable starting point would be to see whether errors could be narrowed down by looking for cluster jobs that exited with a failed status code. Unfortunately the superlog doesn’t contain the LSF summary output for any of the spawned jobs and so I had to try and recover this myself. I pulled out job numbers with grep and cut and fed them to an LSF command that returns status information for completed jobs. I noticed a pattern and with a little grep-fu I was able to pull-out jobs that failed specifically due to resource constraints, a familiar problem. Though in this case, the resource constraints were imposed on each job by rules in the Makefile, the cure was to merely be less frugal.

Generously doubling or tripling the requested RAM for each job step and launching the bridgebuilder pipeline created an explosion of work on the cluster and within hours we had apparently falled in to a scenario where we had a majority of the files desired.

But wielding a strong distrust for anything that a computer does, I was not satisfied with the progress.

Reaping with quickcheck

Pretty much all of the intermediary files, as well as the final output files are “BAMs”. BAM (Binary sAM) files are compressed binary representations of SAM files that contain data pertaining to DNA sequences and alignments thereof. samtools is a suite of bioinformatics tools primarily designed to work with such files. After anticipated of samtools index proved unreliable for catching more subtle errors, Josh wrote a new subcommand for samtools called: quickcheck.

samtools quickcheck is capable of quickly checking (shocker) whether a BAM, SAM or CRAM file appears “intact”; i.e. the header is valid, the file contains at least one target sequence and for BAMs that the compressed end-of-file (EOF) block is present, which is elsewise a pretty good sign your BAM is truncated. I fed each of the now thousands of BAM files to quickcheck and hundreds of filenames began to flow up the terminal, revealing a new problem with the bridgebuilder pipeline’s Makefile:

  • File Creation is Success
    The existence of a file, be it an intermediate or final output file is viewed as a success, regardless of the exit status of the job that created that file.

The Makefile was set-up to remove files that are left broken or unfinished if a job fails, but this does not happen if the job is forcibly terminated by LSF for exceeding its time or memory allocation, which was unfortunately the case for hundreds of jobs. This rule also does not cover cases where a failure occurs but the command does not return a failing exit code, a problem that still plagues older parts of samtools. Thus, quickchecking the results is the only assuring step that can be performed.

Worse still, once a file existed, the next step was executed under the assumption the input intermediate file must be valid as it had not been cleaned away. Yet this step would likely fail itself due to the invalid input, it too leaving behind an invalid output file to go on to the next step, and so on.

As an inexpensive operation, it was simple to run quickcheck on every BAM file in a reasonable amount of time. Any file that failed a quickcheck was removed from the directory, forcing it to be regenerated anew on the next iteration of the carousel pipeline.

Rinse and Repeat

After going full-nuclear on files that upset quickcheck, the next iteration of the pipeline took much longer, but thankfully the superlog was quieter with respect to errors. It seemed that having increased limits on the requests for time and memory, most of the jobs were happy to go about their business successfully.

But why repeat what should be a deterministic process?

  • Jobs and nodes fail stochastically
    We’re running a significant number of simultaneous jobs, all reading and writing large files to the same directory. It’s just a statistical question of when rather than if at least one bit goes awry, and indeed it clearly does. Repeating the process and removing any files that fail quickcheck allows us to ensure we’re only running the jobs where something has gone wrong.
  • The Makefile stalls
    Submitting a Makefile to a node with the intention for it to then submit jobs of its own to other nodes and keep a line of communication with each of them open, is potentially not ideal. Occasionally server blips would cause these comm lines to fail and the Makefile would stall waiting to hear back from its children. In this scenario the superjob would have to be killed which can leave quite a mess behind.

Following a few rounds of repeated application of the pipeline and rinsing with quickcheck, we once again appeared in the reasonable position of having the finish line in sight. We were just around 200 lanelets off our 870 total. A handful of errors implied that some jobs wanted even more resources and due to the less monolithic nature of the superlog, a repetitive error with the final brunel step throwing a segmentation fault came to my attention.

But not being easily satisfied, I set out to find something wrong with our latest fresh batch of bridges.

Clobbering with sort

After killing a set of intermediary jobs that appeared to have stalled — all sorting operations using samtools sort — I inspected the damage; unsurprisingly there were several invalid output file, but upon inspection of the assigned temporary directory, only a handful of files named .tmp.NNNN.bam (where NNNN is some sequential index). That’s odd… There should have been a whole host of temporary files…

I checked the Makefile, we were using samtools sort‘s -T argument to provide a path to the temporary directory. Except -T is not supposed to be a directory path:

-T PREFIX Write temporary files to PREFIX.nnnn.bam

So, providing a directory path to -T sets the PREFIX to \path\to\tmp\.NNNN.bam.

Oh shit. Each job shared the same -T and must therefore have shared a PREFIX, this is very bad. Immediately I knocked up a quick test case on my laptop to see whether a hunch would prove correct, it did: I found a “bug” in samtools sort: samtools sort clobbers temporary files if “misusing” -T.

Although this was caused by our mistaken use of -T as a path to a temporary directory not a “prefix”, the resulting behaviour is dangerous, rather unexpected and as explained in my bug report below, woefully beautiful:

Reproduce

Execute two sorts providing a duplicate prefix or duplicate directory path to -T:

Result

output1.sam contains input2.sam’s header along with read groups appearing in input2.sam and potentially some from input1.sam. output2.sam is not created as the job aborts.

Behaviour

  • Job A begins executing, creating temporary files in tmp/ with the name .0000.tmp to .NNNN.tmp.
  • Job B begins, chasing Job A, clobbering temporary files in tmp/ from .0000.tmp to .MMMM.tmp.
  • Job A completes sorting temporary files and merges .0000.tmp to .NNNN.tmp. Thus incorrectly including read groups from Job B in place of the desired read groups from Job A, where N <= M.
  • Job A completes merging and deletes temporary files.
  • Job B crashes attempting to read .0000.tmp as it no longer exists.

By default, samtools sort starts writing temporary files to disk if it exceeds 768MB of RAM in a sorting thread. Initially, none of the sorting jobs performed by the Makefile were alloted more than 512MB of RAM total and thus any job on track to exceed 768MB RAM should have been killed by LSF and a temporary file would never have touched disk. When I showed up and liberally applied RAM to the affected area, sorting operations with a need for temporary files became possible and the clobbering occurred, detected only by luck.

What’s elegantly disconcerting about this bug is that the resulting file is valid and thus cannot be identified by quickcheck, my only weapon against bad BAMs. A solution would be to extract the header and every readgroup that occurs in the file and look for at least one contradiction. However, due to the size of these BAMs (upwards of 5GB, containing tens of millions of alignment records) it turned out to be faster to regenerate them all, rather than trying to diagnose which specifically might have fallen prey to the bug for targeted regneration.

I altered the Makefile to use the name of the input file after the path when creating the PREFIX to use for temporary files in future, side-stepping the clobbering (and using -T as intended).

brunel and The 33

Following the regeneration of any sorted file (and all files that use those sorted files thereafter), it was time to tackle the mysterious stack of segementation faults occuring at the final hurdle of the process, brunel. Our now rather uncluttered superlog made finding cases for reproducing simple, especially given the obvious thirty eight line backtrace and memory map spewed out after brunel encountered an abort:

*** glibc detected *** /software/hgi/pkglocal/bridgebuilder-0.3.6/bin/brunel: munmap_chunk(): invalid pointer: 0x00000000022ec900 ***
[...]
/tmp/1436571537.7566550: line 8: 21818 Aborted /software/hgi/pkglocal/bridgebuilder-0.3.6/bin/brunel 8328_1#15.GRCh37-hs37d5_bb_bam_header.txt 8328_1#15_GRCh37-hs37d5_unchanged.bam:GRCh37-hs37d5.trans.txt 8328_1#15_GRCh37-hs37d5_remap.queryname_sort.map_hs37d5.fixmate.coordinate_sort.bam 8328_1#15_GRCh37-hs37d5_bridged.queryname_sort.fixmate.coordinate_sort.bam 8328_1#15.GRCh37-hs37d5_bb.bam

Back with our old friend, gdb, I found the error seemed to occur when brunel attempted to free a pointer to what should have been a line of text from one of the input files:

char* linepointer = NULL;
[...]
while (!feof(trans_file) && !ferror(trans_file) && counter > 0) {
    getline(&linepointer, &read, trans_file);
    [...]
}
free(linepointer);

Yet if linepointer was never assigned to, perhaps because trans_file was empty, brunel would attempt to free NULL and everything would catch fire and abort, case closed. However whilst this could happen, my theory was disproved with some good old-fashion print statement debugging — trans_file was indeed not empty and linepointer was in fact being used.

After much head-scratching, more print-debugging and commenting out various lines, I discovered all ran well if I removed the line that updates the i-th element of the translation array, trans. How strange? I debugged some more and uncovered a bug in brunel:

int *trans = malloc(sizeof(int)*file_entries);

char* linepointer = NULL;
[...]
while (!feof(trans_file) && !ferror(trans_file) && counter > 0) {
    getline(&linepointer, &read, trans_file);
    [...]

    // lookup tid of original and replacement
    int i = 0;
    for ( ; i < file_entries; i++ ) {
      char* item = file_header->target_name[i];
      if (!strcmp(item,linepointer)) { break; }
    }
    int j = 0;
    for ( ; j < replace_entries; j++ ) {
        char* item = replace_header->target_name[j];
        if (!strcmp(item,sep)) { break; }
    }
    trans[i] = j;
    counter--;
}
free(linepointer);

Look closely. Assignments to the trans array rely on setting both i and j by breaking those two loops when a condition is met. However if the condition is never met, the loop counter will be equal to the loop end condition. This is particularly important for i, as it is used to index trans.

At a length of file_entries, assigning an element at the maximum bound of loop counter i: file_entries, causes brunel to write to memory beyond that which was allocated to the storing of trans, likely in to the variable that was assigned immediately afterwards… linepointer.

So it seems that linepointer is not NULLbut is infact mangled with the value of file_entries. I added a quick patch that checked whether i &lt; file_entries before trying to update the trans array and all was well.

Yet, curiosity got the better of me, only a small subset (33/870) files were throwing an abort at this brunel step, why not all of them? What’s more, after some grepping over our old pipeline logs, the same 33 lanelets each time too. Weird.

Clearly they all shared some property. The lanelet inputs were definitely valid and a brief manual inspection of the headers and scroll through the body showed nothing glaringly out of place. Focussing back on the function of brunel and the nature of this error: the i loop not encountering a break because the exit condition is not met, gives us a good place to start.

These loops in brunel are designed to “translate” the sequence identifiers from the old reference, to that of the new reference. For example: the label for the first chromosome in GRCh37 is “Chr1” but in hs37d5, it’s known simply as “1”. It’s obviously important these match up and ensuring that is in the purpose of the build_translation_file function. This mapping is defined by a text file that is also passed to brunel at invocation. If there is a valid mapping to be made, the loops meet their conditions and when trans[i] = j: i refers to the SQ line in the header to be translated and j refers to the new SQ line in the new file that will contain all the reads. Lovely.

What about when there are no valid mappings to be made? Under what conditions could this even happen? The i-loop would never meet its break condition if a chromosome to be translated (as listed in the translation text file) was never seen in the first place. i.e. There were no sequence identifiers using the GRCh37 syntax. A sprinkling of debugging print statements confirmed no translations ever occurred, and trans[i] = file_entries was always the result for each iteration through the translation file…

I opened the headers for one of “the 33” lanelets again and confusingly, it was already mapped to hs37d5.

Turns out, these 33 lanelets were from a later phase of the study from which they originated and thus had already been mapped to hs37d5 as part of that phase’s pipeline. We’d been trying to forcibly remap them to the same reference they were already mapped to.

The solution was simple, remove the lanelet identifiers for “the 33” from our pipeline and pull the raw lanelets (as they’ve already been mapped to hs37d5) from the database later. We now had just 837 lanelets to care for.

The 177

That should have been the end of things, the 837 lanelets we needed to worry about had made their way through the entire pipeline from start to finish and we had 837 bridged BAMs. But, with my paranoia getting out of hand (and given the events so far, perhaps it was now justified), I collected the file sizes for each class of intermediate file and calculated summary statistics.

Unsurprisingly, I found an anomaly. For one class of file, BAM files that were supposed to contain all reads that binnie determined as requiring re-alignment as they fell on a bridge: a high standard deviation and rather postively skewed looking set of quartiles. Looking at the files in question, 387 had a size of 177 bytes. Opening just one with samtools view -h drew a deep but unexpected sigh:

@SQ SN:MT LN:16569
@SQ SN:NC_007605 LN:171823
@SQ SN:hs37d5 LN:35477943
@PG ID:bwa PN:bwa VN:0.5.9-r16

Empty other than a modest header. Damn. Damn damn damn. I went for a walk.

The LSF records didn’t reach far back enough to let me find out how this might have happened, September 2014 was millions of jobs ago. The important thing is that this problem didn’t seem to be happening to newer jobs that had generated this particular class of file more recently.

Why was this still causing trouble now? Remember: File Creation is Success!

The solution was to nuke these “has_177” files and let the Makefile take careof the rest.

The Class of 2014

Once that was complete, we had a healthier looking set of 837 lanelets and your average “bridged” BAM looked bigger too, which is good news. Ultra paranoid, despite the results now passing quickcheck, I decided to regenerate a small subset of the 450 non-177 lanelets from scratch and compare the size and md5 of the resulting files to the originals.

They were different.

The culprit this time? Indexes built by bwa that were generated in 2014 were much smaller than their 2015 counterparts. In a bid to not keep repeating the past, I nuked any intermediate file that was not from that week and let the Makefile regenerate the gaps and final outputs.

All now seemed well.

Final Checks

Not content with finally reaching this point, I tried to rigorously prove the data was ready for the sample improvement step via vr-pipe:

  • quickcheck
    Ensure the EOF block was present and intact.
  • bsub -e
    Checked that none of the submissions had actually reported an error to LSF during execution.
  • bsub -d UNKN
    Ensured that submissions did not spend any time during execution in an “UNKNOWN” state, an indicator of the job stalling and likely producing bad output.
  • samtools view -c
    Used the counting argument of samtools view to confirm the number of reads in the final bridged BAM matched the number of reads in the input lanelet. Using samtools view to count records in the entire file will also catch issues with truncation or corruption (albeit in a time consuming fashion).
  • gzip -tf
    Ensure the compression of each block in the BAM was correct.

Now equipped with 870 lanelets that passed these checks, it was time to attempt BAM improvement.


tl;dr

]]>
https://samnicholls.net/2015/07/23/bridgebuilding-p2/feed/ 1 174
The Tolls of Bridge Building: Part I, Background https://samnicholls.net/2015/07/22/bridgebuilding/ https://samnicholls.net/2015/07/22/bridgebuilding/#respond Wed, 22 Jul 2015 11:00:27 +0000 http://blog.ironowl.io/?p=185 I’m at the Sanger Institute for just another two weeks before the next stop of my Summer Research Tour and it’s about time I checked in. For those of you who still thought I was in Aberystwyth working on my tan1 I suggest you catch up with my previous post.

The flagship part of my glorious return to Sanger is to try and finally conclude what was started last year in the form of my undergraduate disseration. Let’s revisit anticipated tasks and give an update on progress.

Recall in detail what we were doing and figure out how far we got

*i.e.* Dig out the thesis, draw some diagrams and run ls everywhere.

This was easier than expected, in part because my thesis was not as bad as my memory serves but also thanks to Josh’s well organised system for handling the data towards the end of the project remaining intact even a year after we’d stopped touching the directory.

Confirm the Goldilocks region

It would be worth ensuring the latest version of Goldilocks (which has long fixed some bugs I would really like to forget) still returns the same results as it did when I was doing my thesis.

As stated, Goldilocks has been significant revamped since it was last used for this quality control project. Having unpacked and dusted down the data from my microserver2, I just needed to make a few minor edits to the original script that called Goldilocks to conform to new interfaces. Re-running the analysis took less than a minute and the results were the same, we definitely had the “just right” region for the data we had.

Feeling pretty good about this, it was then that I realised that the data we had was not the data we would use. Indeed as I had mentioned in a footnote of my last post, our samples had come from two different studies that had been sequenced in two different depths, one study sequenced at twice the coverage of the other and so we had previously decided not to use the samples from the lesser-covered study.

But I never went back and re-generated the Goldilocks region to take this in to account. Oh crumbs.

For the purpose of our study, Goldilocks was designed to find a region of the human genome that expressed a number of variants that are representative of the entire genome itself — not too many, not too few, but “just right“. These variants were determined by the samples in the studies themselves, thus the site at which a variant appeared in could appear in either of the two studies, or both. Taking away one of these studies will take away its variants having a likely impact on the Goldilocks region.

Re-running the analysis with just the variants in the higher-coverage study of course led to different results. Worse still, the previously chosen Goldilocks region was no longer the candidate recommended to take through to the pipeline.

Our definition of “just right” was a two-step process:

  • Identify regions containing a number of variants between ±5 percentiles of the median number of variants found over all regions, and
  • Of those candidates, sort them descending by the number of variants that appear on a seperate chip study.

Looking at the median number of variants and the new level of variance in the distribution of variants, I widened the criteria of that first filtering step to ±10 percentiles, mostly in the hope that this would make the original region come back and make my terrible mistake go away but also partly because I’d convinced myself it made statistical sense. Whatever the reason, under a census with the new criteria the previous region returned and both Josh and I could breathe a sigh of relief.

I plotted a lovely Circos graph too:

circos

You’re looking at a circular representation of chromosome 3 of the human genome. Moving from outside to inside you see: cytogenetic banding, a scatter plot of counts of variants at 1Mbp regions over the whole-genome study, a heatmap encoding the scatter plot, “Goldiblocks” where those variant counts fall within the desired ±10 percentiles of the median, followed by a second ring of “Goldiblocks” where variant counts fall in the top 5 percentiles of the seperate chip study, a heatmap and scatter plot of variants seen at the corresponding 1Mbp regions in the chip study.

Finally, the sparse yellow triangles along the outer circle pinpoint the “Goldilocks” candidate regions: where the inner “Goldiblocks” overlap, representing a region’s match to both criteria. Ultimately we used the candidate between 46-47Mbp on this chromosome. Note in particular the very high number of variants on the chip study at that region, allowing a very high number of sites at which to compare the results of our repeated analysis pipeline.

Confirm data integrity

The data has been sat around in the way of actual in-progress science for the best part of a year and has possibly been moved or “tidied away”.

There has been no housekeeping since the project stalled, presumably because Sanger hasn’t been desperate for the room. In fact, a quick df reveals an eye watering ~6PB available distributed over the various scratch drives which makes our ~50TB scratch cluster back in Aberystwyth look like a flash drive. Everything is where we left it and with a bit of time and tea, Josh was able to explain the file hierarchy and I was left to decipher the rest.

But before I talk about what we’ve been doing for the past few weeks, I’ll provide a more detailed introduction to what we are trying to achieve:

Background

Reminder: Project Goal

As briefly explained in my last post, the goal of our experiment is to characterise, in terms of quality what a “bad” sample actually is. The aim would be to then use this to train a machine learning algorithm to identify future “bad” samples and prevent a detrimental effect on downstream anaylsis.

To do this, we plan to repeatedly push a few thousand samples which are already known to have passed or failed Sanger’s current quality control procedure through an analysis pipeline, holding out a single sample in turn. We’ll then calculate the difference to a known result and be able to reclassify the left-out sample as good or bad based based on whether the accuracy of a particular run improved or not in their absence.

However, the explanation is somewhat oversimplified but to get more in depth, we’ll first have to introduce some jargon.

Terminology: Samples, Lanes and Lanelets

A sample is a distinct DNA specimen extracted from a particular person. For the purpose of sequencing, samples are pipetted in to a flowcell – a glass slide containing a series of very thin tubules known as lanes. It is throughout these lanes that the chemical reactions involved in sequencing take place. Note that a lane can contain more than one sample and a sample can appear in more than one lane; this is known as sample multiplexing and helps to ensure that the failure of a particular lane does not hinder analysis of a sample (as it will still be sequenced as part of another lane). The more abstract of the definitions, a lanelet is the aggregate read of all clusters of a particular sample in a single lane.

Samples are made from lanelets, which are gained as a result of sequencing over lanes in a sequencer. So, it is in fact lanelets that we perform quality control on and thus want to create a machine learning classifier for. Why not just make the distinction to begin with? Well, merely because it’s often easier to tell someone the word sample when giving an overview of the project!

Terminology: Sample Improvement

A sample is typically multiplexed through and across many lanes, potentially on different flowcells. Each aggregation of a particular sample in a single lane is a lanelet and thus these lanelets effectively compose the sample. Thus when one talks about “a sample”, in terms of post-sequencing data, we’re looking at a cleverly combined concoction of all the lanelets that pertain to that sample.

Unsimplifying

The Problem of Quality

When Sanger’s current quality control system deems that a lanelet has “failed” it’s kicked to the roadside and dumped in a database for posterity, potentially in the hope of being awakened by a naive undergraduate student asking questions about quality. Lanelets that have “passed” are polished up and taken care of and never told about their unfortunate siblings. The key is that time and resources aren’t wasted pushing perceivably crappy data through post-sequencing finishing pipelines.

However for the purpose of our analysis pipeline, we want to perform a leave-one-out analysis for all lanelets (good and bad) and measure the effect caused by their absence. This in itself poses some issues:

  • Failed lanelets aren’t polished
    We must push all lanelets that have failed through the same process as their passing counterparts such that they can be treated equally during the leave-one-out analysis. i.e. We can’t just use the failed lanelets as they are because we won’t be able to discern whether they are bad because they are “bad” or just because they haven’t been pre-processed like the others.

  • Failed lanelets are dropped during improvement
    Good lanelets compose samples, bad lanelets are ignored. For any sample that contained one or more failed lanelets, we must regenerate the improved sample file.

  • Polished lanelets are temporary
    Improved samples are the input to most downstream analysis, nothing ever looks at the raw lanelet data. Once the appropriate (good) processed lanelets are magically merged together to create an improved sample, those processed lanelets are not kept around. Only raw lanelets are kept on ice and only improved samples are kept for analysis. We’ll need to regenerate them for all passed lanelets that appear in a sample with at least one failed lanelet.

The Data

The study of interest yields 9508 lanelets which compose 2932 samples, actual people. Our pipeline calls for the “comparison to a known result”. The reason we can make such a comparison is because a subset of these 2932 participants had their DNA analysed with a different technology called a “SNP chip” (or array). Rather than whole genome sequencing, very specific positions of interest in a genome can be called with higher confidence. 903 participants were genotyped with both studies, yielding 2846 lanelets.

Of these 903 samples; 229 had at least one lanelet (but not all) fail quality control and 5 had all associated lanelets fail. Having read that, this is tabulated for convenience below:

Source Lanelets Samples
All 9508 2932
Overlapping 2846 903
…of which failed partially 339 229
…of which failed entirely 15 5

But we can’t just worry about failures. For our purposes, we need the improved samples to include the failed lanelets. To rebuild those 229 samples that contain a mix of both passes and fails, we must pull all associated lanelets; all 855. Including the 15 from the 5 samples that failed entirely, we have a total of 870 lanelets to process and graduate to full-fledged samples.

The next step is to commence a presumably lengthy battle with pre-processing tools.


tl;dr

  • I’m still at Sanger for two weeks, our data is seemingly intact and the Goldilocks region is correct.
  • Circos graphs are very pretty.

  1. Though I have been to Spain
  2. Scene missing Sam panicking over zfs backup pools no longer being mounted on aforementioned microserver. 
]]>
https://samnicholls.net/2015/07/22/bridgebuilding/feed/ 0 185
Sanger Sequel https://samnicholls.net/2015/06/24/sanger-sequel/ https://samnicholls.net/2015/06/24/sanger-sequel/#respond Wed, 24 Jun 2015 11:00:02 +0000 http://blog.ironowl.io/?p=206 In a change to scheduled programming, days after touching down from my holiday (which needs a post of its own) I moved1 to spend the next few weeks back at the Wellcome Trust Sanger Institute in Cambridgeshire. I interned here previously in 2012 and it’s still like working at a science-orientated Google thanks to the overwhelming amount of work being done and the crippling inferiority complex that comes from being surrounded by internationally renowned scientists. Though at least I’m not acquiring significant mass from free food.

My aim here is two fold and outlined below. Though of course, I’m still on the books back at Aberystwyth and it would be both naughty and cruel of me to leave my newly acquired data cold, alone and untouched until I get back.

Aims

(1) Produce a Sequel

My undergraduate dissertation was titled: Application of Machine Learning Techniques to Next Generation Sequencing Quality Control and worked in collaboration with some colleagues from my previous placement at the Sanger Institute2. The project was to build a machine learning framework capable of improving detections of “bad” samples by first characterising what it meant to be a bad sample.

In short, the idea was to repeatedly push a large number of samples (each known to have individually passed or failed some internal quality control mechanism) through some analysis pipeline, holding out a single sample out from the analysis in turn. The difference to a known result would then be calculated and samples would be re-classified as good or bad based on whether the accuracy of a particular run was increase or decreased in their absence.

Ultimately the scope was too large and the tools too fragile to complete the end-goal in the time that I had (though it still achieved 90% and won an award, so one can’t complain too much) but we still have the data and while I am here it would be interesting to try and pick up where we left off. I expect to do battle with the following tasks over the next few days:

Recall in detail what we were doing and figure out how far we got
i.e. Dig out the thesis, draw some diagrams and run ls everywhere.

Confirm the Goldilocks region
Due in part to the short time that I had to complete this project the first time around — a constraint I still have — I authored a tool named Goldilocks to “narrow down” my analysis from a whole genome to just a 1Mbp window. It would be worth ensuring the latest version of Goldilocks (which has long fixed some bugs I would really like to forget) still returns the same results as it did when I was doing my thesis.

Confirm data integrity
The data has been sat around in the way of actual in-progress science for the best part of a year and has possibly been moved or “tidied away”. It would be worth ensuring all the files are actually intact and for the sake of completeness revisit how those files came to be and regenerate them. This will encompass ensuring the Goldilocks region for each sample was correctly extracted. I recall the samples were made up of two studies and we may have decided not to pursue one of them due to differences in sequencing3. I also recall having some major trouble with needing to re-align the failed samples to a different reference: these samples having failed, were not subjected to all the processing of their QC approved counterparts, which we’ll need to apply ourselves manually, presumably painstakingly.

Prepare data for the pipeline
The nail in the coffin for the first stab at this project was the data preparation: samtools merge was just woefully slow in handling the scale of data that I had, in particular struggling to merge many thousands of files at once. A significant amount of project time was spent tracking and patching memory leaks and contributing other functionality (more on this in a moment) that left me with little time at the end to actually push the data through the pipeline and get results. samtools has undergone some rapid improvements since and I suspect this step will no longer pose such a hurdle.

(2) Contribute to samtools

As I briefly alluded, during the course of my undergraduate dissertation I authored several pull requests to a popular open-source bioinformatics toolkit known as samtools, which was initially created and continues to be maintained right here at the Sanger Institute. In particular, these pull requests improved documentation and patched some memory leaks for samtools merge and also added naive header parsing for input file metadata to be organised into basic structures for much more efficient iterative access later; significantly improving the time performance of samtools merge.

Header parsing has been a long sought after feature for samtools but none of the core maintainers had the time to put aside to take a good look at the RFC I had submitted. Now I’m in-house and I put a face to a username, catching the most recent samtools steering meeting off-guard, I’ve been tasked to try and get this done before I leave at the end of July.

No pressure.


tl;dr

  • I live in Cambridge until the end of July, please don’t try and find me in my office.4

  1. For what is at least the 10th major house move I’ve made since heading out to university. 
  2. As soon as I had my foot in the door I refused to take it away. 
  3. The sequencing was conducted at different depths between the two standalone studies and it was suspected this may introduce some bias I didn’t want to deal with. 
  4. Not that I’m ever in there, anyway. 
]]>
https://samnicholls.net/2015/06/24/sanger-sequel/feed/ 0 206
The Story so Far: Part I, A Toy Dataset https://samnicholls.net/2015/04/21/the-story-so-far-p1/ https://samnicholls.net/2015/04/21/the-story-so-far-p1/#respond Tue, 21 Apr 2015 11:00:23 +0000 http://blog.ironowl.io/?p=126 In this somewhat long and long overdue post; I’ll attempt to explain the work done so far and an overview of the many issues encountered along the way and an insight in to why doing science is much harder than it ought to be.

This post got a little longer than anticipated, so I’ve sharded it like everything else around here.


In the beginning…

To address my lack of experience in handling metagenomic data, I was given a small1 dataset to play with. Immediately I had to adjust my interpretation of what constitutes a “small” file. Previously the largest single input I’ve had to work with would probably have been the human reference genome (GRCh37) which as a FASTA2 file clocks in at around a little over 3GB3.

Thus imagine my dismay when I am directed to the directory of my input data and find 2x42GB file.
Together, the files are 28x the size of the largest file I’ve ever worked with…

So, what is it?

Size aside, what are we even looking at and how is there so much of it?

The files represent approximately 195 million read pairs from a nextgen4 sequencing run, with each file holding each half of the pair in the FASTQ format. The dataset is from a previous IBERS PhD student and was introduced in a 2014 paper titled Metaphylogenomic and potential functionality of the limpet Patella pellucida’s gastrointestinal tract microbiome [PubMed]. According to the paper over 100 Blue-rayed Limpets (Patella pellucida) were collected from the shore of Aberystwyth and placed in to tanks to graze on Oarweed (Laminaria digitata) for one month. 60 were plated, anesthetized and aseptically dissected; vortexing and homogenizing the extracted digestion tracts before repeated filtering and final centrifugation to concentrate cells as a pellet. The pellets were then resuspended and DNA was extracted with a soil kit to create an Illumina paired-end library.

The paper describes the post-sequencing data handling briefly: the net result of 398 million reads were quality processed using fastq-mcf; to remove adaptor sequences, reads with quality lower than Q20 and reads shorter than 31bp. The first 15bp of each read were also truncated5. It was noted the remaining 391 million reads were heavily contaminated with host-derived sequences and thus insufficient for functional analysis.

My job was to investigate to what extent the contamination had occurred and to investigate whether any non-limpet reads could be salvaged for functional analysis.

Let’s take a closer look at the format to see what we’re dealing with.

FASTQ Format

FASTQ is another text based file format, similar to FASTA but also stores quality scores for each nucleotide in a sequence. Headers are demarcated by the @ character instead of > and although not required tend to be formatted strings containing information pertaining to the sequencing device that produced the read. Sequence data is followed by a single + on a new line, before a string of quality scores (encoded as ASCII characters within a certain range, depending on the quality schema used) follows on another new line:

@BEEPBOOP-SEQUENCER:RUN:CELL:LANE:TILE:X:Y 1:FILTERED:CONTROL:INDEX
HELLOIAMASEQUENCEMAKEMEINTOAPROTEINPLEASES
+
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

This example uses Illumina 1.8+ sequence identifiers and quality scores, the same as those found in the dataset. The quality string represents increasing quality from 0 (worst) to 41 (best) left to right. Taking the first read from the first file as an actual example, we get a better look at a “real-world” sequence header:

@HWI-D00173:21:D22FBACXX:5:1101:1806:1986 1:N:0:CTCTCT
TTGTGTCAAAACCGAACAACATGACAATCTTACTTGCCTGGCCCTCCGTCCTGCACTTCTGGCATGGGGAAACCACACTGGGGGC
+
IIIAEGIIIFIIIEGIFFIIIFIFIIEFIIIIEFIIEFGCDEFFFFABDDCCCCCBBBBBBBBBBBBBB?BBBB@B?BBBBBBB5

So how are these files so large6? Given each read record takes four lines (assuming reads short enough to not be spead over multiple lines — which they are not in our case) and each file contains around 195 million reads, we’re looking at 780 million lines. Per file.

The maximum sequence size was 86bp and each base takes one byte to store, as well as corresponding per-base quality information:

86 * 2 * 195000000
> 33540000000B == 33.54GB

Allowing some arbitrary number of bytes for headers and the + characters:

((86 * 2) + 50 + 1) * 195000000
> 43485000000B == 43.49GB

It just adds up. To be exact, both input files span 781,860,356 lines each — meaning around 781MB of storage is used merely for newlines alone! It takes wc 90 seconds to just count lines, these files aren’t small at all!

Quality Control

Although already done (as described by the paper), it’s good to get an idea of how to run and interpret basic quality checks on the input data. I used FASTQC which outputs some nice HTML reports that can be archived somewhere if you are nice and organised.

Top Tip
It’s good to be nice and organised because in writing this blog post I’ve been able to quickly retreive the FASTQC reports from October and realised I missed a glaring problem as well as a metric that could have saved me from wasting time.

Shit happens. Well, it’s good you’re checking, I’m much less organised.

— Francesco

For an input FASTQ file, FASTQC generate a summary metrics table. I’ve joined the two tables generated for my datasets below.

Measure Value (_1) Value (_2)
Filename A3limpetMetaCleaned_1.fastq.trim A3limpetMetaCleaned_2.fastq.trim
File type Conventional base calls Conventional base calls
Encoding Sanger / Illumina 1.9 Sanger / Illumina 1.9
Total Sequences 195465089 195465089
Filtered Sequences 0 0
Sequence length 4-86 16-86
%GC 37 37

Here I managed to miss two things:

  • Both files store the same number of sequences (which is expected as the sequences are paired), something that I apparently forget about shortly…
  • Both files do not contain sequences of uniform length, neither do the non-uniform lengths have the same range, meaning that some pairs will not align to correctly as they cannot overlap fully…

FASTQC also generates some nice graphs, of primary interest, per-base sequence quality over the length of a read:

A3limpetMetaCleaned_1.fastq.trim A3limpetMetaCleaned_2.fastq.trim
pbq1 pbq2

Although made small to fit, both box plots clearly demonstrate that average base quality (blue line) lives well within the “green zone” (binning scores of 28+) slowly declining to a low of around Q34. This is a decent result, although not surprising considering quality filtering has already been performed on the dataset to remove poor quality reads! A nice sanity check nonetheless. I should add that it is both normal and expected for average per-base quality to fall over the length of a read (though this can be problematic if the quality falls drastically) by virtue of the unstable chemistry involved in sequencing.

FASTQC can plot a distribution of GC content against a hypothetical normal distribution, this is useful for genomic sequencing where one would expect such a distribution. However a metagenomic sample will (should) contain many species that may have differing distributions of GC content across their invididual genomes. FASTQC will often raise a warning about the distribution of GC content for such metagenomic samples given a statistically significant deviation from or violation of the theoretical normal. These can be ignored.

Two other tests also typically attract warnings or errors; K-mer content and sequence duplication levels. These tests attempt to quantify the diversity of the reads at hand, which is great when you are looking for diversity within reads of one genome; raising a flag if perhaps you’ve accidentally sequenced all your barcodes or done too many rounds of PCR and been left with a lot of similar looking fragments. But once again, metagenomics violates expectations and assumptions made by traditional single-species genomics. For example, highly represented species ([and|or] fragments that do well under PCR) will dominate samples and trigger apparently high levels of sequence duplication. Especially so if many species share many similar sequences which is likely in environments undergoing co-evolution.

FASTQC also plots N count (no call), GC ratio and average quality scores across whole reads as well as per-base sequence content (which should be checked for a roughly linear stability) and distribution of sequence lengths (which should be checked to ensure the majority of sequences are a reasonable size). Together a quick look at all these metrics should provide a decent health check before moving forward, but they too should be taken with a pinch of salt as FASTQC tries to help you answer the question “are these probably from one genome?”.

Trimming

Trimming is the process of truncating bases that fall below a particular quality threshold at the start and end of reads. Typically this is done to create “better” read overlaps (as potential low-quality base calls could otherwise prevent overlaps that should exist) which can help improve performance of downstream tools such as assemblers and aligners.

Trimming had already been completed by the time I had got hold of the dataset7, but I wanted to perform a quick sanity check and ensure that the two files had been handled correctly8. Blatantly forgetting about and ignoring the FASTQC reports I had generated and checked over, I queried both files with grep:

grep -c '^@' $FILE

196517718   A3limpetMetaCleaned_1.fastq.trim
196795722   A3limpetMetaCleaned_2.fastq.trim

“Oh dear”9, I thought. The number of sequences in both files are not equal. “Somebody bumbled the trimming!”. I hypothesised that low-quality sequences had been removed but the corresponding mate in the other file had not been removed to keep the pairs in sync.

With hindsight, let’s take another look at the valid range of quality score characters for the Illumina 1.8+ format:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
^                              ^         ^
0..............................31........41

The record delimiting character; @ is used to encode Q31. For some reason somebody thought it would be a great idea to make this character available for use in quality strings. I’d been counting records as well as any quality line that happened to begin with an encoded score of Q31, the @.


And so, I unnecessarily launched myself head first in to my first large scale computing problem; given two sets of ~196 million reads which mostly overlap, how can we efficiently find the intersection (and write it to disk)?


tl;dr

  • In bioinformatics, “small” really isn’t small.
  • Try to actually read quality reports, then read them again. Then grab a coffee or go outside and come back and read them for a third time before you do something foolish and have to tell everyone at your lab meeting about how silly you are.
  • Don’t count the number of sequences in a FASTQ file by counting @ characters at the start of a line, it’s a valid quality score for most quality encoding formats.

  1. Now realised to be a complete misnomer, both in terms of size and effort. 
  2. A text based file format where sequences are delimited by > and a sequence name [and|or] description, followed by any number of lines containing nucleotides or amino acids (or in reality, whatever you fancy):

    >Example Sequence Hoot Factor 9 | 00000001
    HELLOIAMASEQUENCE
    BEEPBOOPTRANSLATE
    MEINTOPROTEINS
    >Example Sequence Hoot Factor 9 | 00000002
    NNNNNNNNNNNNNNNNN

    Typically sequence lines are of uniform length (under 80bp), though this is not a requirement of the format. The NCBI suggest formats for the header (single line descriptor, following the > character) though these are also not required to be syntactically valid. 

  3. Stored as text we take a byte for each of the 3 billion nucleotides as well as each newline delimiter and an arbitrary number of bytes for each chromosome’s single line header. 
  4. Seriously, can we stop calling it nextgen yet? 
  5. I’m unsure why, from a recent internal talk I was under the impression we’d normally trim the first “few” bases (1-3bp, maybe up to 13bp if there’s a lot of poor quality nucleotides) to try and improve downstream analysis such as alignments (given the start and end of reads can often be quite poor and not align as well as they should) but 15bp seems excessive. It also appears the ends of the reads were not truncated.

    Update
    It’s possible an older library prep kit was used to create the sequencing library, thus the barcodes would have needed truncating from the reads along with any poor scoring bases.

    The ends of the reads were not truncated as the quality falls inside a reasonable threshold.
     

  6. Or small, depending on whether you’ve adjusted your world view yet. 
  7. I’m unsure why, from a recent internal talk I was under the impression we’d normally trim the first “few” bases (1-3bp, maybe up to 13bp if there’s a lot of poor quality nucleotides) to try and improve downstream analysis such as alignments (given the start and end of reads can often be quite poor and not align as well as they should) but 15bp seems excessive. It also appears the ends of the reads were not truncated.

    Update
    It’s possible an older library prep kit was used to create the sequencing library, thus the barcodes would have needed truncating from the reads along with any poor scoring bases.

    The ends of the reads were not truncated as the quality falls inside a reasonable threshold.
     

  8. Primarily because I don’t trust anyone. Including myself.[^me] 
  9. I’m sure you can imagine what I really said. 
]]>
https://samnicholls.net/2015/04/21/the-story-so-far-p1/feed/ 0 126