malformedreadfilter – Samposium

Duplicate definition error with GATK PrintReads and MalformedReadFilter

Sam — Thu, 07 Jan 2016 19:27:17 +0000

This afternoon I wanted to quickly check¹ whether some reads in a BAM would be filtered out by the GATK MalformedReadFilter. As you can’t invoke the filter alone, I figured one of the quickest ways to do this would be to utilise GATK PrintReads, which pretty much parses and spits out input BAMs, while also allowing one to specify filters and the like to be applied to the parser as it dutifully goes by its job of taking up all your cluster’s memory. I entered the command, taking care to specify MalformedRead for the -rf read filter option, feeling particularly pleased with myself for finally being capable of using a GATK command from memory:

java -jar GenomeAnalysisTK.jar -T PrintReads -rf MalformedRead -I  -R

GATK, wanting to teach me a lesson for not consulting documentation, quickly dumped a stack trace to my terminal and wiped the smile off my face.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Duplicate definition of argument with full name: filter_reads_with_N_cigar
        at org.broadinstitute.gatk.utils.commandline.ArgumentDefinitions.add(ArgumentDefinitions.java:59)
        at org.broadinstitute.gatk.utils.commandline.ParsingEngine.addArgumentSource(ParsingEngine.java:150)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:207)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Duplicate definition of argument with full name: filter_reads_with_N_cigar
##### ERROR ------------------------------------------------------------------------------------------

At this point I felt somewhat hopeless, I was actually trying to use the MalformedReadFilter to debug something else, now I was stuck two errors deep surrounded by more Java than I could stomach. Before having a full breakdown about whether bioinformatics really is broken, I remembered I am a little familiar with the filter in question. Indeed, I recognised the filter_reads_with_N_cigar argument from the error as one that can be supplied to the MalformedReadFilter itself. This seems a little odd, where could it be getting a duplicate definition from?

Of course, from my own blog post and the PrintReads manual page, I should have recalled that the MalformedReadFilter is automatically applied by PrintReads. Specifying the same filter on top with -rf apparently causes somewhat of a parsing upset. So there you have it, if you want to check whether your reads will be discarded by the MalformedReadFilter, you can just use PrintReads:

java -jar GenomeAnalysisTK.jar -T PrintReads I  -R

tl;dr

GATK PrintReads applies the MalformedReadFilter automatically
Specifying -rf MalformedRead to PrintReads is not only redundant but problematic
Always read the fucking manual
Read your own damn blog
GATK is unforgiving

It’s about time I realised that in bioinformatics, nobody has ever successfully “quickly checked” anything. ↩

Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)

Sam — Wed, 11 Nov 2015 16:11:50 +0000

The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file .dict for reference .fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

java -jar ~/git/picard-tools-1.138/picard.jar CreateSequenceDictionary R=.fa O=.dict

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov): A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

java -Djava.io.tmpdir=/not/a/shoebox/ -jar   ...

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles¹. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

$ ulimit -n
1024

$ java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ...

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

java -Xmx:G -jar   ...

The `MalformedReadFilter` has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.

tl;dr

Your BAM files should end in .bam
Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
Be indexing, always
pysam is a pretty nice package for dealing with SAM/BAM files in Python
Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
Picard may generously overestimate the number of file handles available
GATK is a spoilt child and will have as many file handles as it wants
Apply more memory to GATK with java -Xmx:G to avoid running out of heap
Remember, the MalformedReadFilter is rather aggressive
You need a bigger computer

At the time of writing, 8000. ↩

The Tolls of Bridge Building: Part IV, Mysterious Malformations

Sam — Thu, 27 Aug 2015 11:00:38 +0000

Following a short hiatus on the sample un-improvement job which may or may not have been halted by vr-pipe inadvertently knocking over a storage node at the Sanger Institute, our 837 non-33 jobs burst back in to life only to fall at the final hurdle of the first pipeline of the vr-pipe workflow. Despite my lack of deerstalker and pipe, it was time to play bioinformatics Sherlock Holmes.

Without mercury access to review reams of logs myself, the vr-pipe web interface provides a single sample of output from a random job in the state of interest. Taking its offering, it seemed I was about to get extremely familiar with lanelet 7293_3#8. The error was at least straightforward…

Accounting Irregularities¹

... [step] failed because 48619252 reads were generated in the output bam
file, yet there were 48904972 reads in the original bam file at [...]
/VRPipe/Steps/bam_realignment_around_known_indels.pm line 167

The first thing to note is that the error is raised in vr-pipe itself, not in the software used to perform the step in question — which happened to be GATK, for those interested. vr-pipe is open source software, hosted on Github. The deployed release of vr-pipe used by the team is a fork and so the source raising the error is available to anyone with the stomach to read Perl:

my $expected_reads = $in_file->metadata->{reads} || $in_file->num_records;
my $actual_reads = $out_file->num_records;

if ($actual_reads == $expected_reads) {
    return 1;
}
else {
    $out_file->unlink;
    $self->throw("cmd [$cmd_line] failed because $actual_reads reads were generated in the output bam file, yet there were $expected_reads reads in the original bam file");
}

The code is simple, vr-pipe just enforces a check that all the reads from the input file make it through to the output file. Seems sensible. This is very much desired behaviour as we never drop reads from BAM files, there’s enough flags and scores in the SAM spec to put reads aside for one reason or another without actually removing them. So we know that the job proper completed successfully (there was an output file sans errors from GATK itself). So the question now is, where did those reads go?

Though before launching a full scale investiation, my first instinct was to question the error itself and check the numbers stated in the message were even correct. It was easy to confirm the original number of reads: 48,904,972 by checking both the vr-pipe metadata I generated to submit the bridged BAMs to vr-pipe. I ran samtools view -c on the bridged BAM again to be sure.

But vr-pipe‘s standard over-reaction to job failure is nuking all the output files, so I’d need to run the step myself manually on lanelet 7293_3#8. A few hours later, samtools view -c and samtools stats confirmed the output file really did contain 48,619,252 reads, a shortfall of 285,720.

I asked Irina, one of our vr-pipe sorceresses with mercury access to dig out the whole log for our spotlight lanelet. Two stub INFO lines sat right at the tail of the GATK output shed immediate light on the situation…

Expunged Entries

INFO 18:31:28,970 MicroScheduler – 285720 reads were filtered out during the traversal out of approximately 48904972 total reads (0.58%)

INFO 18:31:28,971 MicroScheduler – -> 285720 reads (0.58% of total) failing MalformedReadFilter

Every single of the 285,720 “missing” reads can be accounted for by this MalformedReadFilter. This definitely isn’t expected behaviour, as I said already, we have ways of marking reads as QC failed, supplementary or unmapped without just discarding them wholesale. Our pipeline is not supposed to drop data. The MalformedReadFilter documentation on the GATK website states:

Note that the MalformedRead filter itself does not need to be specified in the command line because it is set automatically.

This at least explains the “unexpected” nature of the filter, but surely vr-pipe encounters files with bad reads that need filtering all the time? My project can’t be the first… I figured I must have missed a pre-processing step, I asked around: “Was I supposed to do my own filtering?”. I compared the @PG lines documenting programs applied to the 33 to see whether they had undergone different treatment, but I couldn’t see anything related to filtering, quality or otherwise.

I escalated the problem to Martin who replied with a spy codephrase:

the MalformedReadFilter is a red herring

Both Martin and Irina had seen similar errors that were usually indicative of the wrong version of vr-pipe [and|or] samtools — causing a subset of flagged reads to be counted rather than all. But I explained the versions were correct and I’d manually confirmed the reads as actually missing by running the command myself, we were all a little stumped.

I read the manual for the filter once more and realised we’d ignored the gravity of the word “malformed”:

This filter is applied automatically by all GATK tools in order to protect them from crashing on reads that are grossly malformed.

We’re not taking about bad quality reads, we’re talking about reads that are incorrect in such a way that it may cause an analysis to terminate early. My heart sank, I had a hunch. I ran the output from lanelet 7293_3#8‘s bridgebuilder adventure through GATK PrintReads, an arbitrary tool that I knew also applied the MalformedReadFilter. Those very same INFO lines were printed.

My hunch was right, the reads had been malformed all along.

Botched Bugfix

I had a pretty good idea as to what had happened but I ran the inputs to brunel (the final step of bridgebuilder) through PrintReads as a sanity check. This proved a little more difficult to orchestrate than one might have expected, I had to falsify headers and add those pesky missing RG tags that plagued us before.

The inputs were fine, as suspected, brunel was the malformer. My hunch? My quick hack to fix my last brunel problem had come back to bite me in the ass and caused an even more subtle brunel problem.

Indeed, despite stringently checking the bridged BAMs with five different tools, successfully generating an index and even processing the file with Picard to mark duplicate reads and GATK to detect and re-align around indels, these malformed reads still flew under the radar — only to be caught by a few lines of Perl that a minor patch to vr-pipe happened to put in the way.

Recall that my brunel fix initialises the translation array with -1:

Initialise trans[i] = -1
The only quick-fix grade solution that works, causes any read on a TID that has no translation to be regarded as “unmapped”. Its TID will be set to “*” and the read is placed at the end of the result file. The output file is however, valid and indexable.

This avoided an awful bug where brunel would assign reads to chromosome 1 if their actual chromosome did not have a translation listed in the user-input translation file. In practice, the fix worked. Reads appeared “unmapped”, their TID was an asterisk and they were listed at the end of the file. The output was viewable, indexable and usable with Picard and GATK, but technically not valid afterall.

To explain why, let’s feed everybody’s favourite bridged BAM 7293_3#8 to Picard ValidateSamFile, a handy tool that does what it says on the tin. ValidateSamFile linearly passes over each record of a BAM and prints any associated validation warnings or errors to a log file². As the tool moved over the target, an occassional warning that could safely be ignored was written to my terminal³. The file seemed valid and as the progress bar indicated we were nearly out of chromosomes, I braced.

As ValidateSamFile attempted to validate the unmapped reads, a firehose of errors (I was expecting a lawn sprinker) spewed up the terminal and despite my best efforts I couldn’t Ctrl-C to point away from face.

False Friends

There was a clear pattern to the log file. Each read that I had translated to unmapped triggered four errors. Taking just one of thousands of such quads:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Mapped mate should have mate reference name
ERROR: […] Mapped read should have valid reference name
ERROR: […] Alignment start should be 0 because reference name =*.

Now, a bitfield of flags on each read is just one of the methods employed by the BAM format to perform validation and filtering operations quickly and efficiently. One of these bits: 0x4, indicates a read is unmapped. Herein lies the problem, although I had instructed brunel to translate the TID (which refers to the i-th @SQ line in the header) to -1 (i.e. no sequence), I did not set the unmapped flag. This is invalid (Mapped read should have valid reference name), as the read will appear as aligned, but to an unknown reference sequence.

Sigh. A quick hack combined with my ignorance of the underlying BAM format specification was at fault⁴.

The fix would be frustratingly easy, a one-liner in brunel to raise the UNMAPPED flag (and to be good, a one-liner to unset the PROPER_PAIR flag⁵) for the appropriate reads. Of course, expecting an easy fix jinxed the situation and the MalformedReadFilter cull persisted, despite my new semaphore knowledge.

For each offending read, ValidateSamFile produced a triplet of errors:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] MAPQ should be 0 for unmapped read.
ERROR: […] Alignment start should be 0 because reference name =*.

The errors at least seem to indicate that I’d set the UNMAPPED flag correctly. Confusingly, the format spec has the following point (parentheses and emphasis mine):

Bit 0x4 (unmapped) is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2 (properly aligned), 0x100 (secondary), and 0x800 (supplemental).

It would seem that canonically, the alignment start (POS) and mapping quality (MAPQ) are untrustworthy on reads where the unmapped flag is set. Yet this triplet appears exactly the same number of times as the number of reads hard filtered by the MalformedReadFilter.

I don’t understand how these reads could be regarded as “grossly malformed” if even the format specification acknowledges the possibility of these fields containing misleading information. Invalid yes, but grossly malformed? No. I just have to assume there’s a reason GATK is being especially anal about such reads, perhaps developers simply (and rather fairly) don’t want to deal with the scenario of not knowing what to do with reads where the values of half the fields can’t be trusted⁶. I updated brunel to set the alignment positions for the read and its mate alignment to position 0 in retaliation.

I’d reduced the ValidateSamFile erorr quads to triplets and now, the triplets to duos:

ERROR: […] Mate Alignment start should be 0 because reference name = *.
ERROR: […] Alignment start should be 0 because reference name =*.

The alignment positions are non-zero? But I just said I’d fixed that? What gives?

I used samtools view and grabbed the tail of the BAM, those positions were indeed non-zero. I giggled at the off-by-one error, immediately knowing what I’d done wrong.

The BAM spec describes the alignment position field as:

POS: 1-based leftmost mapping POSition of the first matching base.

But under the hood, the pos field is defined in htslib as a 0-based co-ordinate, because that’s how computers work. The 0-based indices are then converted by just adding 1 whenever necessary. Thus to obtain a value of 0, I’d need to set the positions to -1. This off-by-one mismatch is a constant source of embarrassing mistakes in bioinformatics, where typically 1-based genomic indices must be reconciled with 0-indexed data structures⁷.

Updating brunel once more, I ran the orchestrating Makefile to generate what I really hope to be the final set of bridged BAMs, ran them through Martin’s addreplacerg subcommand to fill in those missing RG tags and then each went up against each of the now-six final check tools (yes, they validated with ValidateSamFile). I checked index generation, re-ran a handful of sanity checks (mainly ensuring we hadn’t lost any reads), re-generated the manifest file before finally asking Irina for what I really hope to be the last time, to reset my vr-pipe setup.

I should add, Martin suggested that I use samtools fixmate, a subcommand designed for this very purpose. The problem is, for fixmate to know where the mates are, they must be adjacent to each-other; that is, sorted by name and not by co-ordinate. It was thus cheaper both in time and computation to just re-run brunel than to name-sort, fixmate and coord-sort and current bridged BAMs.

Victorious `vr-pipe` Update: Hours later

Refreshing the vr-pipe interface intermittently, I was waiting to see a number of inputs larger than 33 make it to the end of the first pipeline of the workflow: represented by a friendly looking green progress bar.

Although the number currently stands at 1, I remembered that all of the 33 had the same leading digit and that for each progress bar, vr-pipe will offer up metadata on one sample job. I hopefully clicked the green progress bar and inspected the metadata, the input lanelet was not in the set of the 33 already re-mapped lanelets.

I’d done it. After almost a year, I’ve led my lanelet Lemmings to the end of the first level.

tl;dr

GATK has an anal, automated and aggressive MalformedReadFilter
Picard ValidateSamFile is a useful sanity check for SAM and BAM files
Your cleverly simple bug fix has probably swapped a critical and obvious problem for a critically unnoticeable one
Software is not obliged to adhere to any or all of a format specification
Off by one errors continue to produce laughably simple mistakes
I should probably learn the BAM format spec inside out
Perl is still pretty grim

Those reads were just resting in my account!
Had I known of the tool sooner, I would have employed it as part of the extensive bridgebuilder quality control suite.
Interestingly, despite causing jobs to terminate, a read missing an RG tag is a warning, not an error.
Which only goes to further my don’t trust anyone, even yourself mantra.
Although not strictly necessary as per the specification (see below), I figured it was cleaner.
Though, looking at the documentation, I’m not even sure what aspect of the read is triggering the hard filter. The three additional command line options described don’t seem to be related to any of the errors raised by ValidateSamFile and there is no explicit description of what is considered to be “malformed”.
When I originally authored Goldilocks, I tried to be clever and make things easier for myself, electing to use a 1-based index strategy throughout. This was partially inspired by FORTRAN, which features 1-indexed arrays. In the end, the strategy caused more problems than it solved and I had to carefully had to tuck my tail between my legs and return to a 0-based index.

malformedreadfilter – Samposium

Duplicate definition error with GATK PrintReads and MalformedReadFilter

tl;dr

Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)

Input file extension pedanticism

A sequence dictionary (and index) is compulsory for use of a FASTA reference

You forgot to index your intermediate BAM

Your temporary directory is probably too small

You may have too many files and not enough file handles

Your maximum Java heap is also too small

The MalformedReadFilter has a looser definition of malformed than expected

tl;dr

The Tolls of Bridge Building: Part IV, Mysterious Malformations

Accounting Irregularities1

Expunged Entries

Botched Bugfix

False Friends

Victorious vr-pipe Update: Hours later

tl;dr

The `MalformedReadFilter` has a looser definition of malformed than expected

Accounting Irregularities¹

Victorious `vr-pipe` Update: Hours later