pipeline – Samposium

Bioinformatics is a disorganised disaster and I am too. So I made a shell.

Sam — Wed, 16 Nov 2016 17:50:59 +0000

If you don’t want to hear me wax lyrical about how disorganised I am, you can skip ahead to where I tell you about how great the pseudo-shell that I made and named chitin is.

Back in 2014, about half way through my undergraduate dissertation (Application of Machine Learning Techniques to Next Generation Sequencing Quality Control), I made an unsettling discovery.

I am disorganised.

The discovery was made after my supervisor asked a few interesting questions regarding some of my earlier discarded analyses. When I returned to the data to try and answer those questions, I found I simply could not regenerate the results. Despite the fact that both the code and each “experiment” were tracked by a git repository and I’d written my programs to output (what I thought to be) reasonable logs, I still could not reproduce my science. It could have been anything: an ad-hoc, temporary tweak to a harness script, a bug fix in the code itself masking a result, or any number of other possible untracked changes to the inputs or program parameters. In general, it was clear that I had failed to collect all pertinent metadata for an experiment.

Whilst it perhaps sounds like I was guilty of negligent book-keeping, it really wasn’t for lack of trying. Yet when dealing with many interesting questions at once, it’s so easy to make ad-hoc changes, or perform undocumented command line based munging of input data, or accidentally run a new experiment that clobbers something. Occasionally, one just forgets to make a note of something, or assumes a change is temporary but for one reason or another, the change becomes permanent without explanation. These subtle pipeline alterations are easily made all the time, and can silently invalidate swathes of results generated before (and/or after) them.

Ultimately, for the purpose of reproducibility, almost everything (copies of inputs, outputs, logs, configurations) was dumped and tar‘d for each experiment. But this approach brought problems of its own: just tabulating results was difficult in its own right. In the end, I was pleased with that dissertation, but a small part of me still hurts when I think back to the problem of archiving and analysing those result sets.

It was a nightmare, and I promised it would never happen again.

Except it has.

A relapse of disorganisation

Two years later and I’ve continued to be capable of convincing a committee to allow me to progress towards adding the title of doctor to my bank account. As part of this quest, recently I was inspecting the results of a harness script responsible for generating trivial haplotypes, corresponding reads and attempting to recover them using Gretel. “Very interesting, but what will happen if I change the simulated read size”, I pondered; shortly before making an ad-hoc change to the harness script and inadvertently destroying the integrity of the results I had just finished inspecting by clobbering the input alignment file used as a parameter to Gretel.

Argh, not again.

Why is this hard?

Consider Gretel: she’s not just a simple standalone tool that one can execute to rescue haplotypes from the metagenome. One must go through the motions of pushing their raw reads through some form of pipeline (pictured below) to generate an alignment (to essentially give a co-ordinate system to those reads) and discover the variants (the positions in that co-ordinate system that relate to polymorphisms on reads) that form the required inputs for the recovery algorithm first.

This is problematic for one who wishes to be aware of the providence of all outputs of Gretel, as those outputs depend not only on the immediate inputs (the alignment and called variants), but the entirety of the pipeline that produced them. Thus we must capture as much information as possible regarding all of the steps that occur from the moment the raw reads hit the disk, up to Gretel finishing with extracted haplotypes.

But as I described in my last status report, these tools are themselves non-trivial. bowtie2 has more switches than an average spaceship, and its output depends on its complex set of parameters and inputs (that also have dependencies on previous commands), too.

bash scripts are all well and good for keeping track of a series of commands that yield the result of an experiment, and one can create a nice new directory in which to place such a result at the end – along with any log files and a copy of the harness script itself for good measure. But what happens when future experiments use different pipeline components, with different parameters, or we alter the generation of log files to make way for other metadata? What’s a good directory naming strategy for archiving results anyway? What if parts (or even all of the) analysis are ad-hoc and we are left to reconstruct the history? How many times have you made a manual edit to a malformed file, or had to look up exactly what combination of sed, awk and grep munging you did that one time?

One would have expected me to have learned my lesson by now, but I think meticulous digital lab book-keeping is just not that easy.

What does organisation even mean anyway?

I think the problem is perhaps exacerbated by conflating the meaning of “organisation”. There are a few somewhat different, but ultimately overlapping problems here:

How to keep track of how files are created
What command created file foo? What were the parameters? When was it executed, by whom?
Be aware of the role that each file plays in your pipeline
What commands go on to use file foo? Is it still needed?
Assure the ongoing integrity of past and future results
Does this alignment have reads? Is that FASTA index up to date?
Are we about to clobber shared inputs (large BAMS, references) that results depend on?
Archiving results in a sensible fashion for future recall and comparison
How can we make it easy to find and analyse results in future?

Indeed, my previous attempts at organisation address some but not all of these points, which is likely the source of my bad feeling. Keeping hold of bash scripts can help me determine how files are created, and the role those files go on to play in the pipeline; but results are merely dumped in a directory. Such directories are created with good intent, and named something that was likely useful and meaningful at the time. Unfortunately, I find that these directories become less and less useful as archive labels as time goes on… For example, what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100/¹?

This approach also had no way to assure the current and future integrity of my results. Last month I had an issue with Gretel outputting bizarrely formatted haplotype FASTAs. After chasing my tail trying to find a bug in my FASTA I/O handling, I discovered this was actually caused by an out of date FASTA index (.fai) on the master reference. At some point I’d exchanged one FASTA for another, assuming that the index would be regenerated automatically. It wasn’t. Thus the integrity of experiments using that combination of FASTA+index was damaged. Additionally, the integrity of the results generated using the old FASTA were now also damaged: I’d clobbered the old master input.

There is a clear need to keep better metadata for files, executed commands and results, beyond just tracking everything with git. We need a better way to document the changes a command makes in the file system, and a mechanism to better assure integrity. Finally we need a method to archive experimental results in a more friendly way than a time-sensitive graveyard of timestamps, acronyms and abbreviations.

So I’ve taken it upon myself to get distracted from my PhD to embark on a new adventure to save myself from ruining my PhD², and fix bioinformatics for everyone.

Approaches for automated command collection

Taking the number of post-its attached to my computer and my sporadically used notebooks as evidence enough to outright skip over the suggestion of a paper based solution to these problems, I see two schools of thought for capturing commands and metadata computationally:

Intrusive, but data is structured with perfect recall
A method whereby users must execute commands via some sort of wrapper. All commands must have some form of template that describes inputs, parameters and outputs. The wrapper then “fills in” the options and dispatches the command on the user’s behalf. All captured metadata has uniform structure and nicely avoids the need to attempt to parse user input. Command reconstruction is perfect but usage is arguably clunky.
Unobtrusive, best-effort data collection
A daemon-like tool that attempts to collect executed commands from the user’s shell and monitor directories for file activity. Parsing command parameters and inputs is done in a naive best-effort scenario. The context of parsed commands and parameters is unknown; we don’t know what a particular command does, and cannot immediately discern between inputs, outputs, flags and arguments. But, despite the lack of structured data, the user does not notice our presence.

There is a trade-off between usability and data quality here. If we sit between a user and all of their commands, offering a uniform interface to execute any piece of software, we can obtain perfectly structured information and are explicitly aware of parameter selections and the paths of all inputs and desired outputs. We know exactly where to monitor for file system changes, and can offer user interfaces that not only merely enumerate command executions, but offer searching and filtering capabilities based on captured parameters: “Show me assemblies that used a k-mer size of 31”.

But we must ask ourselves, how much is that fine-grained data worth to us? Is exchanging our ability to execute commands ourselves worth the perfectly structured data we can get via the wrapper? How much of those parameters are actually useful? Will I ever need to find all my bowtie2 alignments that used 16 threads? There are other concerns here too: templates that define a job specification must be maintained. Someone must be responsible for adding new (or removing old) parameters to these templates when tools are updated. What if somebody happens to misconfigure such a template? More advanced users may be frustrated at being unable to merely execute their job on the command line. Less advanced users could be upset that they can’t just copy and paste commands from the manual or biostars. What about smaller jobs? Must one really define a command template to run trivial tools like awk, sed, tail, or samtools sort through the wrapper?

It turns out I know the answer to this already: the trade-off is not worth it.

Intrusive wrappers don’t work: a sidenote on `sunblock`

Without wanting to bloat this post unnecessarily, I want to briefly discuss a tool I’ve written previously, but first I must set the scene³.

Within weeks of starting my PhD, I made a computational enemy in the form of Sun Grid Engine: the scheduler software responsible for queuing, dispatching, executing and reporting on jobs submitted to the institute’s cluster. I rapidly became frustrated with having an unorganised collection of job scripts, with ad-hoc edits that meant I could no longer re-run a job previously executed with the same submission script (does this problem sound familiar?). In particular, I was upset with the state of the tools provided by SGE for reporting on the status of jobs.

To cheer myself up, I authored a tool called sunblock, with the goal of never having to look at any component of Sun Grid Engine directly ever again. I was successful in my endeavour and to this day continue to use the tool on the occasion where I need to use the cluster.

However, as hypothesised above, sunblock does indeed require an explicit description of an interface for any job that one would wish to submit to the cluster, and it does prevent users from just pasting commands into their terminal. This all-encompassing wrapping feature; that allows us to capture the best, structured information on every job, is also the tool’s complete downfall. Despite the useful information that could be extracted using sunblock (there is even a shiny sunblock web interface), its ability to automatically re-run jobs and the superior reporting on job progress compared to SGE alone, was still not enough to get user traction in our institute.

For the same reason that I think more in-the-know bioinformaticians don’t want to use Galaxy, sunblock failed: because it gets in the way.

Introducing `chitin`: an awful shell for awful bioinformaticians

Taking what I learned from my experimentation with sunblock on-board, I elected to take the less intrusive, best-effort route to collecting user commands and file system changes. Thus I introduce chitin: a Python based tool that (somewhat)-unobtrusively wraps your system shell, to keep track of commands and file manipulations to address the problem of not knowing how any of the files in your ridiculously complicated bioinformatics pipeline came to be.

I initially began the project with a view to create a digital lab book manager. I envisaged offering a command line tool with several subcommands, one of which could take a command for execution. However as soon as I tried out my prototype and found myself prepending the majority of my commands with lab execute, I wondered whether I could do better. What if I just wrapped the system shell and captured all entered commands? This might seem a rather dumb and long-about way of getting one’s command history, but consider this: if we wrap the system shell as a means to capture all the input, we are also in a position to capture the output for clever things, too. Imagine a shell that could parse the stdout for useful metadata to tag files with…

I liked what I was imagining, and so despite my best efforts to get even just one person to convince me otherwise; I wrote my own pseudo-shell.

WHAT'S THAT? A SHELL WITH BUILT IN FUNCTIONS FOR HOW YOUR FILES HAPPENED AND WHAT YOU NEED TO REPEAT TO GET TO A GIVEN FILE? WHY YES IT IS pic.twitter.com/h87pzptq1E

— Sam Nicholls (@samstudio8) November 15, 2016

chitin is already able to track executed commands that yield changes to the file system. For each file in the chitin tree, there is a full modification history. Better yet, you can ask what series of commands need to be executed in order to recreate a particular file in your workflow. It’s also possible to tag files with potentially useful metadata, and so chitin takes advantage of this by adding the runtime⁴, and current user to all executed commands for you.

Additionally, I’ve tried to find my own middle ground between the sunblock-esque configurations that yielded superior metadata, and not getting in the way of our users too much. So one may optionally specify handlers that can be applied to detected commands, and captured stdout/stderr. For example, thanks to my bowtie2 configuration, chitin tags my out.sam files with the overall alignment rate (and a few targeted parameters of interest), automatically.

chitin also allows you to specify handlers for particular file formats to be applied to files as they are encountered. My environment, for example, is set-up to count the number of reads inside a BAM, and associate that metadata with that version of the file:

In this vein, we are in a nice position to check on the status of files before and after a command is executed. To address some of my integrity woes, chitin allows you to define integrity handlers for particular file formats too. Thus my environment warns me if a BAM has 0 reads, is missing an index, or has an index older than itself. Similarly, an empty VCF raises a warning, as does an out of date FASTA index. Coming shortly will be additional checks for whether you are about to clobber a file that is depended on by other files in your workflow. Kinda cool, even if I do say so myself.

Conclusion

Perhaps I’m trying to solve a problem of my own creation. Yet from a few conversations I’ve had with folks in my lab, and frankly, anyone I could get to listen to me for five minutes about managing bioinformatics pipelines, there seems to be sympathy to my cause. I’m not entirely convinced myself that a “shell” is the correct solution here, but it does seem to place us in the best position to get commands entered by the user, with the added bonus of getting stdout to parse for free. Though, judging by the flurry of Twitter activity on my dramatically posted chitin screenshots lately, I suspect I am not so alone in my disorganisation and there are at least a handful of bioinformaticians out there who think a shell isn’t the most terrible solution to this either. Perhaps I just need to be more of a wet-lab biologist.

Either way, I genuinely think there’s a lot of room to do cool stuff here, and to my surprise, I’m genuinely finding chitin quite useful already. If you’d like to try it out, the source for chitin is open and free on GitHub. Please don’t expect too much in the way of stability, though.

tl;dr

A definition of “being organised” for science and experimentation is hard to pin down
But independent of such a definition, I am terminally disorganised
Seriously what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100¹
I think command wrappers and platforms like Galaxy get in the way of things too much
I wrote a “shell” to try and compensate for this
Now I have a shell, it is called chitin

This is a genuine directory in my file system, created about a month ago. It contains results for a run of Gretel against the pol gene on the HIV genome (2084-5083). Off the top of my head, I cannot recall what sd100 is, or why reg appears before the base positions. I honestly tried. ↩ ↩
Because more things that are not my actual PhD is just what my PhD needs. ↩
If it helps you, imagine some soft jazz playing to the sound of rain while I talk about this gruffly in the dark with a cigarette poking out of my mouth. Oh, and everything is in black and white. It’s bioinformatique noir. ↩
I’m quite pleased with this one, because I pretty much always forget to time how long my assemblies and alignments take. ↩

The Tolls of Bridge Building: Part IV, Mysterious Malformations

Sam — Thu, 27 Aug 2015 11:00:38 +0000

Following a short hiatus on the sample un-improvement job which may or may not have been halted by vr-pipe inadvertently knocking over a storage node at the Sanger Institute, our 837 non-33 jobs burst back in to life only to fall at the final hurdle of the first pipeline of the vr-pipe workflow. Despite my lack of deerstalker and pipe, it was time to play bioinformatics Sherlock Holmes.

Without mercury access to review reams of logs myself, the vr-pipe web interface provides a single sample of output from a random job in the state of interest. Taking its offering, it seemed I was about to get extremely familiar with lanelet 7293_3#8. The error was at least straightforward…

Accounting Irregularities¹

... [step] failed because 48619252 reads were generated in the output bam
file, yet there were 48904972 reads in the original bam file at [...]
/VRPipe/Steps/bam_realignment_around_known_indels.pm line 167

The first thing to note is that the error is raised in vr-pipe itself, not in the software used to perform the step in question — which happened to be GATK, for those interested. vr-pipe is open source software, hosted on Github. The deployed release of vr-pipe used by the team is a fork and so the source raising the error is available to anyone with the stomach to read Perl:

my $expected_reads = $in_file->metadata->{reads} || $in_file->num_records;
my $actual_reads = $out_file->num_records;

if ($actual_reads == $expected_reads) {
    return 1;
}
else {
    $out_file->unlink;
    $self->throw("cmd [$cmd_line] failed because $actual_reads reads were generated in the output bam file, yet there were $expected_reads reads in the original bam file");
}

The code is simple, vr-pipe just enforces a check that all the reads from the input file make it through to the output file. Seems sensible. This is very much desired behaviour as we never drop reads from BAM files, there’s enough flags and scores in the SAM spec to put reads aside for one reason or another without actually removing them. So we know that the job proper completed successfully (there was an output file sans errors from GATK itself). So the question now is, where did those reads go?

Though before launching a full scale investiation, my first instinct was to question the error itself and check the numbers stated in the message were even correct. It was easy to confirm the original number of reads: 48,904,972 by checking both the vr-pipe metadata I generated to submit the bridged BAMs to vr-pipe. I ran samtools view -c on the bridged BAM again to be sure.

But vr-pipe‘s standard over-reaction to job failure is nuking all the output files, so I’d need to run the step myself manually on lanelet 7293_3#8. A few hours later, samtools view -c and samtools stats confirmed the output file really did contain 48,619,252 reads, a shortfall of 285,720.

I asked Irina, one of our vr-pipe sorceresses with mercury access to dig out the whole log for our spotlight lanelet. Two stub INFO lines sat right at the tail of the GATK output shed immediate light on the situation…

Expunged Entries

INFO 18:31:28,970 MicroScheduler – 285720 reads were filtered out during the traversal out of approximately 48904972 total reads (0.58%)

INFO 18:31:28,971 MicroScheduler – -> 285720 reads (0.58% of total) failing MalformedReadFilter

Every single of the 285,720 “missing” reads can be accounted for by this MalformedReadFilter. This definitely isn’t expected behaviour, as I said already, we have ways of marking reads as QC failed, supplementary or unmapped without just discarding them wholesale. Our pipeline is not supposed to drop data. The MalformedReadFilter documentation on the GATK website states:

Note that the MalformedRead filter itself does not need to be specified in the command line because it is set automatically.

This at least explains the “unexpected” nature of the filter, but surely vr-pipe encounters files with bad reads that need filtering all the time? My project can’t be the first… I figured I must have missed a pre-processing step, I asked around: “Was I supposed to do my own filtering?”. I compared the @PG lines documenting programs applied to the 33 to see whether they had undergone different treatment, but I couldn’t see anything related to filtering, quality or otherwise.

I escalated the problem to Martin who replied with a spy codephrase:

the MalformedReadFilter is a red herring

Both Martin and Irina had seen similar errors that were usually indicative of the wrong version of vr-pipe [and|or] samtools — causing a subset of flagged reads to be counted rather than all. But I explained the versions were correct and I’d manually confirmed the reads as actually missing by running the command myself, we were all a little stumped.

I read the manual for the filter once more and realised we’d ignored the gravity of the word “malformed”:

This filter is applied automatically by all GATK tools in order to protect them from crashing on reads that are grossly malformed.

We’re not taking about bad quality reads, we’re talking about reads that are incorrect in such a way that it may cause an analysis to terminate early. My heart sank, I had a hunch. I ran the output from lanelet 7293_3#8‘s bridgebuilder adventure through GATK PrintReads, an arbitrary tool that I knew also applied the MalformedReadFilter. Those very same INFO lines were printed.

My hunch was right, the reads had been malformed all along.

Botched Bugfix

I had a pretty good idea as to what had happened but I ran the inputs to brunel (the final step of bridgebuilder) through PrintReads as a sanity check. This proved a little more difficult to orchestrate than one might have expected, I had to falsify headers and add those pesky missing RG tags that plagued us before.

The inputs were fine, as suspected, brunel was the malformer. My hunch? My quick hack to fix my last brunel problem had come back to bite me in the ass and caused an even more subtle brunel problem.

Indeed, despite stringently checking the bridged BAMs with five different tools, successfully generating an index and even processing the file with Picard to mark duplicate reads and GATK to detect and re-align around indels, these malformed reads still flew under the radar — only to be caught by a few lines of Perl that a minor patch to vr-pipe happened to put in the way.

Recall that my brunel fix initialises the translation array with -1:

Initialise trans[i] = -1
The only quick-fix grade solution that works, causes any read on a TID that has no translation to be regarded as “unmapped”. Its TID will be set to “*” and the read is placed at the end of the result file. The output file is however, valid and indexable.

This avoided an awful bug where brunel would assign reads to chromosome 1 if their actual chromosome did not have a translation listed in the user-input translation file. In practice, the fix worked. Reads appeared “unmapped”, their TID was an asterisk and they were listed at the end of the file. The output was viewable, indexable and usable with Picard and GATK, but technically not valid afterall.

To explain why, let’s feed everybody’s favourite bridged BAM 7293_3#8 to Picard ValidateSamFile, a handy tool that does what it says on the tin. ValidateSamFile linearly passes over each record of a BAM and prints any associated validation warnings or errors to a log file². As the tool moved over the target, an occassional warning that could safely be ignored was written to my terminal³. The file seemed valid and as the progress bar indicated we were nearly out of chromosomes, I braced.

As ValidateSamFile attempted to validate the unmapped reads, a firehose of errors (I was expecting a lawn sprinker) spewed up the terminal and despite my best efforts I couldn’t Ctrl-C to point away from face.

False Friends

There was a clear pattern to the log file. Each read that I had translated to unmapped triggered four errors. Taking just one of thousands of such quads:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Mapped mate should have mate reference name
ERROR: […] Mapped read should have valid reference name
ERROR: […] Alignment start should be 0 because reference name =*.

Now, a bitfield of flags on each read is just one of the methods employed by the BAM format to perform validation and filtering operations quickly and efficiently. One of these bits: 0x4, indicates a read is unmapped. Herein lies the problem, although I had instructed brunel to translate the TID (which refers to the i-th @SQ line in the header) to -1 (i.e. no sequence), I did not set the unmapped flag. This is invalid (Mapped read should have valid reference name), as the read will appear as aligned, but to an unknown reference sequence.

Sigh. A quick hack combined with my ignorance of the underlying BAM format specification was at fault⁴.

The fix would be frustratingly easy, a one-liner in brunel to raise the UNMAPPED flag (and to be good, a one-liner to unset the PROPER_PAIR flag⁵) for the appropriate reads. Of course, expecting an easy fix jinxed the situation and the MalformedReadFilter cull persisted, despite my new semaphore knowledge.

For each offending read, ValidateSamFile produced a triplet of errors:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] MAPQ should be 0 for unmapped read.
ERROR: […] Alignment start should be 0 because reference name =*.

The errors at least seem to indicate that I’d set the UNMAPPED flag correctly. Confusingly, the format spec has the following point (parentheses and emphasis mine):

Bit 0x4 (unmapped) is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2 (properly aligned), 0x100 (secondary), and 0x800 (supplemental).

It would seem that canonically, the alignment start (POS) and mapping quality (MAPQ) are untrustworthy on reads where the unmapped flag is set. Yet this triplet appears exactly the same number of times as the number of reads hard filtered by the MalformedReadFilter.

I don’t understand how these reads could be regarded as “grossly malformed” if even the format specification acknowledges the possibility of these fields containing misleading information. Invalid yes, but grossly malformed? No. I just have to assume there’s a reason GATK is being especially anal about such reads, perhaps developers simply (and rather fairly) don’t want to deal with the scenario of not knowing what to do with reads where the values of half the fields can’t be trusted⁶. I updated brunel to set the alignment positions for the read and its mate alignment to position 0 in retaliation.

I’d reduced the ValidateSamFile erorr quads to triplets and now, the triplets to duos:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Alignment start should be 0 because reference name =*.

The alignment positions are non-zero? But I just said I’d fixed that? What gives?

I used samtools view and grabbed the tail of the BAM, those positions were indeed non-zero. I giggled at the off-by-one error, immediately knowing what I’d done wrong.

The BAM spec describes the alignment position field as:

POS: 1-based leftmost mapping POSition of the first matching base.

But under the hood, the pos field is defined in htslib as a 0-based co-ordinate, because that’s how computers work. The 0-based indices are then converted by just adding 1 whenever necessary. Thus to obtain a value of 0, I’d need to set the positions to -1. This off-by-one mismatch is a constant source of embarrassing mistakes in bioinformatics, where typically 1-based genomic indices must be reconciled with 0-indexed data structures⁷.

Updating brunel once more, I ran the orchestrating Makefile to generate what I really hope to be the final set of bridged BAMs, ran them through Martin’s addreplacerg subcommand to fill in those missing RG tags and then each went up against each of the now-six final check tools (yes, they validated with ValidateSamFile). I checked index generation, re-ran a handful of sanity checks (mainly ensuring we hadn’t lost any reads), re-generated the manifest file before finally asking Irina for what I really hope to be the last time, to reset my vr-pipe setup.

I should add, Martin suggested that I use samtools fixmate, a subcommand designed for this very purpose. The problem is, for fixmate to know where the mates are, they must be adjacent to each-other; that is, sorted by name and not by co-ordinate. It was thus cheaper both in time and computation to just re-run brunel than to name-sort, fixmate and coord-sort and current bridged BAMs.

Victorious `vr-pipe` Update: Hours later

Refreshing the vr-pipe interface intermittently, I was waiting to see a number of inputs larger than 33 make it to the end of the first pipeline of the workflow: represented by a friendly looking green progress bar.

Although the number currently stands at 1, I remembered that all of the 33 had the same leading digit and that for each progress bar, vr-pipe will offer up metadata on one sample job. I hopefully clicked the green progress bar and inspected the metadata, the input lanelet was not in the set of the 33 already re-mapped lanelets.

I’d done it. After almost a year, I’ve led my lanelet Lemmings to the end of the first level.

tl;dr

GATK has an anal, automated and aggressive MalformedReadFilter
Picard ValidateSamFile is a useful sanity check for SAM and BAM files
Your cleverly simple bug fix has probably swapped a critical and obvious problem for a critically unnoticeable one
Software is not obliged to adhere to any or all of a format specification
Off by one errors continue to produce laughably simple mistakes
I should probably learn the BAM format spec inside out
Perl is still pretty grim

Those reads were just resting in my account!
Had I known of the tool sooner, I would have employed it as part of the extensive bridgebuilder quality control suite.
Interestingly, despite causing jobs to terminate, a read missing an RG tag is a warning, not an error.
Which only goes to further my don’t trust anyone, even yourself mantra.
Although not strictly necessary as per the specification (see below), I figured it was cleaner.
Though, looking at the documentation, I’m not even sure what aspect of the read is triggering the hard filter. The three additional command line options described don’t seem to be related to any of the errors raised by ValidateSamFile and there is no explicit description of what is considered to be “malformed”.
When I originally authored Goldilocks, I tried to be clever and make things easier for myself, electing to use a 1-based index strategy throughout. This was partially inspired by FORTRAN, which features 1-indexed arrays. In the end, the strategy caused more problems than it solved and I had to carefully had to tuck my tail between my legs and return to a 0-based index.

The Tolls of Bridge Building: Part III, Sample (Un)Improvement

Sam — Fri, 31 Jul 2015 11:00:53 +0000

Previously, on Samposium: I finally had the 870 lanelets required for the sample improvement process. But in this post, I explain how my deep-seated paranoia in the quality of my data just wasn’t enough to prevent what happened next.

I submitted my 870 bridged BAMs to vr-pipe, happy to essentially be rid of having to deal with the data for a while. vr-pipe is a complex bioinformatics pipeline that in fact consists of a series of pipelines that each perform some task or another with many steps required for each. The end result is “improved sample” BAMs, though perhaps due to the nature of our inclusion of failed lanelets we should title them “unimproved sample” BAMs. Having just defeated the supposedly automated bridging process was pretty happy to not have to be doing this stuff manually and could finally get on with something else for a while… Or so I thought.

A Quiet Weekend

It was Friday and I was determined to leave on-time to see my family in Swansea at a reasonable hour¹. After knocking up a quick Python script to automatically generate the metadata file vr-pipe requires, describing the lanelets to be submitted and what sample they should be included in the “improvement” of, I turfed the job over to Micheal who has fabled mercury access² and could set-up vr-pipe for me.

Being the sad soul that I am, I occasionally checked in on my job over the weekend via the vr-pipe web interface, only to be confused by the apparent lack of progress. The pipeline appeared to just be juggling jobs around various states of pending. But without mercury access to inspect more, I was left to merely enjoy my weekend.

Between trouble on the server farm and the fact that my job is not particularly high priority, the team suggested I be more patient and so I gave it a few more days before pestering Josh to take a look.

Pausing for Permissions

As suspected, something had indeed gone wrong. Instead of telling anybody, vr-pipe sat on a small mountain of errors, hoping the problem would just go away. I’ve come to understand this is expected behaviour. Delving deep in to the logs revealed the simple problem: vr-pipe did not have sufficient write permissions to the directory I had provided the lanelet files in, because I didn’t provide group-write access to it.

One chmod 775 . later and the pipeline burst in to life, albeit very briefly, before painting the vr-pipe web interface bright red. Evidently, the problem was more serious.

Sorting Names and Numbers

The first proper step for vr-pipe is creating an index of each input file with samtools index. Probably the most important thing to note for an index file, is that to create one, your file must be correctly sorted. Micheal checked the logs for me and found that the indexing job had failed on all but 33 (shocker) of the input files, complaining that they were all unsorted.

But how could this be? brunel requires sorted input to work, our orchestrating Makefile takes care of sorting files as and when needed with samtools sort. There must be some mistake!

I manually invoked samtools index on a handful of my previous bridged BAMs and indeed, they are unsorted. I traced back through the various intermediate files to see when the sort was damaged before finally referring to the Makefile. My heart sank:

[...]
# sort by queryname
%.queryname_sort.bam: LSF_MEM=10000
%.queryname_sort.bam: LSF_CPU=4
%.queryname_sort.bam: %.bam
${SAMTOOLS_BIN} sort -@ ${LSF_CPU} -n -T ${TMP_DIR}$(word 1,$+) -O bam -o $@ $<

# sort by coordinate position
%.coordinate_sort.bam: LSF_MEM=10000
%.coordinate_sort.bam: LSF_CPU=4
%.coordinate_sort.bam: %.bam
${SAMTOOLS_BIN} sort -@ ${LSF_CPU} -n -T ${TMP_DIR}$(word 1,$+) -O bam -o $@ $<
[...]

samtools sort accepts an -n flag, to sort by query name, rather than the default co-ordinate position (i.e. chromosome and base position on the reference sequence). Yet somehow the Makefile had been configured to use query name sorting for both. I knew I was the last one to edit this part of the file, as I’d altered the -T argument to prevent the temporary file clobbering discovered in the last episode.

Had I been lazy and naughty and copied the line from one stanza to the next? I was sure I hadn’t. But the knowing-grin Josh shot at me when I showed him the file, had me determined to try and prove my innocence. Although, it should be first noted that a spot in a special part of hell must be first reserved for the both of us, as the Makefile was not under version control.

I’d squirrelled away many of the original files from 2014, including the intermediates and selecting any co-ordinate sorted file yielded a set of query name sorted reads. My name was cleared! Of course, whilst this was apparently Josh’s mistake, it’s not entirely fair to point the finger given I never noticed the bug despite spending more time nosing around the Makefile than anyone else. As mentioned, I’d even obliviously edited right next to the extraneous -n in question.

But I am compelled to point the finger elsewhere: brunel requires sorted inputs to work correctly, else it creates files that clearly can’t be used in the sample (un)improvement pipeline! How was this allowed to happen?

Sadly, it’s a matter of design. brunel never anticipated that somebody might attempt to provide incorrect input and just gets on with its job regardless. Frustratingly, had this been a feature of brunel, we’d have caught this problem a year ago on the first run of the pipeline.

The co-ordinate sorting step does at least immediately precede invocation of brunel, which is relatively fast. So after correcting the Makefile, nuking the incorrectly sorted BAMs and restarting the pipeline, it wasn’t a long wait before I was ready to hassle somebody with mercury access to push the button.

Untranslated Translations

Before blindly resubmitting everything, I figured I’d save some time and face by adding samtools index to the rest of my checking procedures, to be sure that this first indexing step would at least work on vr-pipe.

The indexing still failed. The final lanelet bridged BAMs were still unsorted.

Despite feeding our now correctly co-ordinate sorted inputs to brunel, we still get an incorrectly sorted output — clearly a faux pas with brunel‘s handling of the inputs. Taking just one of the 837 failed lanelets (all but the already mapped 33 lanelets failed to index) under my wing, I ran brunel manually to try and diagnose the problem.

The error is a little different from the last run, whereas before samtools index complained about co-ordinates appearing out of order, this time, chromosomes appear non-continuously. Unfortunately, several samtools errors still do not give specific information and this is one of them. I knew that somewhere a read, on some chromosome appears somewhere it shouldn’t, but with each of these bridged BAMs containing tens of millions of records, finding it manually could be almost impossible.

I manually inspected the first 20 or so records in the bridged BAM, all the reads were on TID 1, as expected. The co-ordinates were indeed sorted, starting with the first read at position 10,000.

10,000? Seems a little high? On a hunch I grepped the bridged BAM to try and find a record on TID 1 starting at position 0:

grep -Pn -m1 "^[A-z0-9:#]*\t[0-9]*\t1\t1\t" <(samtools view 7500_7#8.GRCh37-hs37d5_bb.bam)

I got a hit. Read HS29_07500:7:1206:5383:182635#8 appears on TID 1, position 1. Its line number in the bridged BAM? 64,070,629. There’s our non-continuous chromosome.

I took the read name and checked the three sorted inputs to brunel. It appears in the “unchanged” BAM. You may recall these “unchanged” reads are those that binnie deemed as not requiring re-mapping to the new reference and can effectively “stay put”. The interesting part of the hit? In the unchanged BAM, it doesn’t appear on chromosome 1 at all but on “HSCHRUN_RANDOM_CTG19”, presumably some form of decoy sequence.

This appears to be a serious bug in brunel. This “HSCHRUN_RANDOM_CTG19” sequence (and presumably others) seem to be leaving brunel as aligned to chromosome 1 in the new reference. Adding some weight to the theory, the unchanged BAM is the only input that goes through translation too.

Let’s revisit build_translation_file, you recall that the i-th SQ line of the input BAM — the “unchanged” BAM — is mapped to the j-th entry of the user-provided translation table text file. The translation itself is recorded with trans[i] = j where both i and j rely on two loops meeting particular exit conditions.

But note the while loop:

int counter = file_entries;
[...]
while (!feof(trans_file) && !ferror(trans_file) && counter > 0) {
    getline(&linepointer, &read, trans_file);
    [...]
    trans[i] = j;
    counter--;
}
[...]

This while loop, that houses the i and j loops, as well as the assignment of trans[i] = j, may not run for as many SQ lines (file_entries) found in the unchanged BAM, as stored in counter, if the length of the trans_file is shorter than the number of file_entries. Which in our case, it is — as there are only entries in the translation text file for chromosomes that actually need translating (Chr1 -> 1).

Thus not every element in trans is set with a value by the while loop. Though we’ve already seen where there is no translation to be made, this doesn’t work either.

This is particularly troubling, as the check for whether a translation should be performed relies on the default value of trans being NULL. As no default value is set and 0 is the most likely value to turn up in the memory allocated by malloc, the default value for trans[i] for all i, is 0. Or in brunel terms, if I can’t find a better translation, translate SQ line i, to the 0th line (first chromosome) in the user translation file.

Holy crap. That’s a nasty bug.

As described in the bug report, I toyed with quick fixes based on initialising trans with default values:

Initialise trans[i] = i
This will likely incorrectly label reads as aligning to the i-th SQ line in the new file. If the number of SQ lines in the input is greater than that of the output, this will also likely cause samtools index to error or segfault as it attempts to index a TID that is outside the range of SQ.

Initialise trans[i] = NULL
Prevents translation of TID but actually has the same effect as above

Initialise trans[i] = -1
The only quick-fix grade solution that works, causes any read on a TID that has no translation to be regarded as “unmapped”. Its TID will be set to “*” and the read is placed at the end of the result file. The output file is however, valid and indexable.

In the end the question of where these reads should actually end up is a little confusing. Josh seems to think that the new bridged BAM should contain all the old reads on their old decoy sequences if a better place in the new reference could not be found for them. In my opinion, these reads effectively “don’t map” to hs37d5 as they still lie on old decoy sequences not found in the new reference, which is convinient as my trans[i] = -1 initialisation marks all such reads as unmapped whilst also remaining the most simple fix to the problem.

Either way, the argument as to what should be done with these reads is particularly moot for me and the QC study, because we’re only going to be calling SNPs and focussing on our Goldilocks Region on Chromosome 3, which is neither a decoy region or Chromosome 1.

Having deployed my own fix, I re-ran our pipeline for what I hoped to be the final time, re-ran the various checks that I’ve picked up along the way (including samtools index) and was finally left with 870 lanelets ready for unimprovement.

Vanishing Read Groups

Or so I thought.

Having convinced Micheal that I wouldn’t demand he assume the role of mercury for me at short notice again, he informed me that whilst all my lanelets had successfully passed the first step of the first pipeline in the entire vr-pipe workflow — indexing. All but (you guessed it), 33 lanelets, failed the following step. GATK was upset that some reads were missing an RG tag.

Sigh. Sigh. Sigh. Table flip.

GATK at least gave me a read name to look up with grep and indeed, these reads were missing their RG tag. I traced the reads backward through each intermediate file to see where these tags were lost. I found that these reads had been binned by binnie as requiring re-mapping to the new reference with bwa. The resulting file from bwa was missing an @RG line and thus each read had no RG tag.

Crumbs. I hit the web to see whether this was anticipated behaviour. The short answer was yes, but the longer answer was “you should have used -r to provide bwa with an RG line to use”. Though I throw my hands up in the air a little here to say “I didn’t write the Makefile and I’ve never used bwa“.

Luckily, Martin has recently drafted a pull request to samtools for a new subcommand: addreplacerg. Its purpose? Adding and replacing RG lines and tags in BAM files. More usefully, at least to us, its default³ it offers an operation “mode” to tag “orphan” records (reads that have no RG line — exactly the problem I am facing) with the first RG line found in the header.

Perfect. I’ll just feed each final bridged BAM I have to samtools addreplacerg and maybe, just maybe, we’ll be able to pull the trigger on vr-pipe for the final time.

I hope.

Queue Queue Update: 1 day later

For those still following at home, my run of samtools addreplacerg seemed to go without a hitch, which in retrospect should have been suspicious. I manually inspected just a handful of the files to ensure both the “orphaned” reads now had an RG tag (and more specifically that it was the correct and only RG line in the file) and that the already tagged reads had not been interfered with. All seemed well.

After hitting the button on vr-pipe once more, it took a few hours for the web interface to catch up and throw up glaring red progress bars. It seems the first step – the BAM indexing was now failing? I had somehow managed to go a step backwards?

The bridged BAMs were truncated… Immediately I begun scouring the source code of samtools addreplacerg before realising in my excitement I had skipped my usual quality control test suite. I consulted bhist -e, a command to display recently submitted cluster jobs that had exited with error and was bombarded with line after line of addreplacerg job metadata. Inquiring specifically, each and every job in the array had violated its run time limit.

I anticipated addreplacerg would not require much processing, it just iterates over the input BAM, slapping an RG sticker on any record missing one and throws it on the output pile. Of course, with tens of millions of records per file even the quickest of operations can aggregate into considerable time. Thus placing the addreplacerg jobs on Sanger’s short queue was clearly an oversight of mine.

I resubmitted to the normal queue which permits a longer default run time limit and applied quality control to all files. We had the green light to re-launch vr-pipe.

Conquering Quotas Update: 4 days later

Jobs slowly crawled through the indexing and integrity checking stages and eventually began their way through the more time-consuming and intensive GATK indel discovery and re-alignment steps, before suddenly and uncontrollably failing in their hundreds. I watched a helpless vr-pipe attempt to resuscitate each job three times before calling a time of death on my project and shutting down the pipeline entirely.

Other than debugging output from vr-pipe scheduling and submitting the jobs, the error logs were empty. vr-pipe had no idea what was happening, and neither did I.

I escalated the situation to Martin, who could dig a little deeper with his mercury mask on. The problem appeared somewhat more widespread than just my pipeline; in fact all pipelines writing to the same storage cluster had come down with a case of sudden unexpected rapid job death. It was serious.

The situation: pipelines are orchestrated by vr-pipe, which is responsible for submitting jobs to the LSF scheduler for execution. LSF requires a user to be named as the owner of a job and so vr-pipe uses mercury. I am unsure whether this is just practical, or whether it is to ensure jobs get a fair share of resources by all having the same owner though I suspect it could just be an inflexibility in vr-pipe. The net result of jobs being run as mercury is that every output file is also owned by mercury. The relevance of this is that every storage cluster has user quotas, an upper bound on the amount of disk space files owned by that user may occupy before being denied write access.

Presumably you can see where this is going. In short, my job pushed mercury 28TB over-quota and so disk write operations failed. Jobs, unable to write to disk aborted immediately but the nature of the error was not propagated to vr-pipe.

Martin is kindly taking care of the necessary juggling to rebalance the books. Will keep you posted.

tl;dr

Make sure your pipeline has the correct permissions to do what it needs with your directory, idiot.
There’s a special place in hell reserved for those who know git and don’t use it⁴.
Brunel naively assumes inputs are sorted
[critical] Brunel unintentionally translates untranlated entries
You should use the -r argument of bwa sampe to ensure your resulting BAM has an RG line
Bioinformatics is never ever simple.
There are bugs everywhere, you probably just haven’t found them yet.

Unfortunately somebody towing a caravan decided to have their car burst in to flame on the southbound M11, this and several other incidents turned a boring five hour journey in to a nine hour tiresome ordeal. ↩
Somewhat like a glorified sudo for humgen projects. ↩
Turns out, its default mode is overwrite_all. ↩
Honestly, it doesn’t matter how trivial the file is, if it’s going to be messed around with frequently, or is a pinnacle piece of code for orchestrating a pipeline, put it under version control. Nobody is going to judge you for trivial use of version control. ↩

pipeline – Samposium

Bioinformatics is a disorganised disaster and I am too. So I made a shell.

A relapse of disorganisation

Why is this hard?

What does organisation even mean anyway?

Approaches for automated command collection

Intrusive wrappers don’t work: a sidenote on sunblock

Introducing chitin: an awful shell for awful bioinformaticians

Conclusion

tl;dr

The Tolls of Bridge Building: Part IV, Mysterious Malformations

Accounting Irregularities1

Expunged Entries

Botched Bugfix

False Friends

Victorious vr-pipe Update: Hours later

tl;dr

The Tolls of Bridge Building: Part III, Sample (Un)Improvement

A Quiet Weekend

Pausing for Permissions

Sorting Names and Numbers

Untranslated Translations

Vanishing Read Groups

Queue Queue Update: 1 day later

Conquering Quotas Update: 4 days later

tl;dr

Intrusive wrappers don’t work: a sidenote on `sunblock`

Introducing `chitin`: an awful shell for awful bioinformaticians

Accounting Irregularities¹

Victorious `vr-pipe` Update: Hours later