samtools – Samposium

Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)

Sam — Wed, 11 Nov 2015 16:11:50 +0000

The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file .dict for reference .fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

java -jar ~/git/picard-tools-1.138/picard.jar CreateSequenceDictionary R=.fa O=.dict

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov): A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

java -Djava.io.tmpdir=/not/a/shoebox/ -jar   ...

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles¹. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

$ ulimit -n
1024

$ java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ...

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

java -Xmx:G -jar   ...

The `MalformedReadFilter` has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.

tl;dr

Your BAM files should end in .bam
Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
Be indexing, always
pysam is a pretty nice package for dealing with SAM/BAM files in Python
Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
Picard may generously overestimate the number of file handles available
GATK is a spoilt child and will have as many file handles as it wants
Apply more memory to GATK with java -Xmx:G to avoid running out of heap
Remember, the MalformedReadFilter is rather aggressive
You need a bigger computer

At the time of writing, 8000. ↩

Status Report: October 2015

Sam — Sun, 01 Nov 2015 19:30:25 +0000

As is customary with any blog that I attempt to keep, I’ve somewhat fallen behind in providing timely updates and am instead hoarding drafts in various states of readiness. This was unhelped by my arguably ill thought out move to install WordPress and the rather painful migration that followed as a result. Now that the dust has mostly settled, I figured it might be nice to outline what I am actually working on before inevitably publishing a new epic tale of computational disaster.

The bulk of my work falls under two main projects that should hopefully sound familiar to those who follow the blog:

Metagenomes

I’ve now entered the second year of my PhD at Aberystwyth University, following my recent fries-and-waffle-fueled research adventure in Belgium. As a brief introduction to the uninitiated, I work in metagenomics: the study of all genetic sequences found in an environment. In particular, I’m interested in the metagenomes of microbial populations that have adapted to produce “interesting” enzymes (catalysts for chemical reactions). A few weeks ago, I presented a poster on the “metahaplome“¹ which is the culmination of my first year of work, to define and formalize how variation in sequences that produce these enzymes can be collected and organized.

DNA Quality Control

Over the summer, I returned to the Wellcome Trust Sanger Institute to continue some work I started as part of my undergraduate thesis. I’ve introduced the task previously and so will spare you the long winded description, but the project initially stalled due to the significant time and effort required to prepare part of the data set. During my brief re-visit, I picked up where I left off with the aim to complete the data set. You may have read that I encountered several problems along the way, and even when this mammoth task finally appeared complete, it was not. Shortly after arriving in Leuven, the final execution of the sample improvement pipeline was done. We’re ready to move forward with the analysis.

Side Projects

As is inevitable when you give a PhD to somebody with a short attention span, I have begun to accumulate some side projects:

SAMTools

The Sequence Alignment and Mapping Tools² suite is a hugely popular open source bioinformatics tookit for interacting with sequencing data. During my undergraduate thesis I contributed a naive header parser to a project fork, that improved the speed of merges of large numbers of sequence files by several orders of magnitude. Recently, amongst a few small fixes here and there, I’ve added functionality to produce samtools stats output split by tags (such as @RG lines) and submitted a proposal to deprecate legacy samtools sort usage. With some time over upcoming holidays, I hope to finally contribute a proper header parser in time for samtools 1.4.

goldilocks

You may remember that I’d authored a Python package called goldilocks (YouTube: Goldilocks: Locating genomic regions that are “just right”, 1st RSG UK Symposium, Oct 2014) as part of my undergraduate work, to find a “just right” 1Mbp region of the human genome that was “representative” in terms of variation expressed. Following some tidying and much optimisation, it’s now a proper package, documented, and I’m now waiting to hear feedback on the submission of my first paper.

sunblock

You may have noticed my opinion on Sun Grid Engine, and the trouble I have had in using it at scale. To combat this, I’ve been working on a small side project called sunblock: a Python command line tool that encapsulates the submission and management of cluster jobs via a more user-friendly interface. The idea is to save anybody else from ever having to use Sun Grid Engine ever again. Thanks to a night in Belgium where it was far too warm to sleep, and a little Django magic, sunblock acquired a super-user-friendly interface and database backend.

Blog

This pain in the arse blog.

tl;dr

I’m still alive
I’m still working
Blogs are hard work

Yes, sorry, it’s another -ome. I’m hoping it won’t find its way on to Jonathan Eisen’s list of #badomes. ↩
Not to be confused with a series of tools invented by me, sadly. ↩

Sanger Sequel

Sam — Wed, 24 Jun 2015 11:00:02 +0000

In a change to scheduled programming, days after touching down from my holiday (which needs a post of its own) I moved¹ to spend the next few weeks back at the Wellcome Trust Sanger Institute in Cambridgeshire. I interned here previously in 2012 and it’s still like working at a science-orientated Google thanks to the overwhelming amount of work being done and the crippling inferiority complex that comes from being surrounded by internationally renowned scientists. Though at least I’m not acquiring significant mass from free food.

My aim here is two fold and outlined below. Though of course, I’m still on the books back at Aberystwyth and it would be both naughty and cruel of me to leave my newly acquired data cold, alone and untouched until I get back.

Aims

(1) Produce a Sequel

My undergraduate dissertation was titled: Application of Machine Learning Techniques to Next Generation Sequencing Quality Control and worked in collaboration with some colleagues from my previous placement at the Sanger Institute². The project was to build a machine learning framework capable of improving detections of “bad” samples by first characterising what it meant to be a bad sample.

In short, the idea was to repeatedly push a large number of samples (each known to have individually passed or failed some internal quality control mechanism) through some analysis pipeline, holding out a single sample out from the analysis in turn. The difference to a known result would then be calculated and samples would be re-classified as good or bad based on whether the accuracy of a particular run was increase or decreased in their absence.

Ultimately the scope was too large and the tools too fragile to complete the end-goal in the time that I had (though it still achieved 90% and won an award, so one can’t complain too much) but we still have the data and while I am here it would be interesting to try and pick up where we left off. I expect to do battle with the following tasks over the next few days:

Recall in detail what we were doing and figure out how far we got
i.e. Dig out the thesis, draw some diagrams and run ls everywhere.

Confirm the Goldilocks region
Due in part to the short time that I had to complete this project the first time around — a constraint I still have — I authored a tool named Goldilocks to “narrow down” my analysis from a whole genome to just a 1Mbp window. It would be worth ensuring the latest version of Goldilocks (which has long fixed some bugs I would really like to forget) still returns the same results as it did when I was doing my thesis.

Confirm data integrity
The data has been sat around in the way of actual in-progress science for the best part of a year and has possibly been moved or “tidied away”. It would be worth ensuring all the files are actually intact and for the sake of completeness revisit how those files came to be and regenerate them. This will encompass ensuring the Goldilocks region for each sample was correctly extracted. I recall the samples were made up of two studies and we may have decided not to pursue one of them due to differences in sequencing³. I also recall having some major trouble with needing to re-align the failed samples to a different reference: these samples having failed, were not subjected to all the processing of their QC approved counterparts, which we’ll need to apply ourselves manually, presumably painstakingly.

Prepare data for the pipeline
The nail in the coffin for the first stab at this project was the data preparation: samtools merge was just woefully slow in handling the scale of data that I had, in particular struggling to merge many thousands of files at once. A significant amount of project time was spent tracking and patching memory leaks and contributing other functionality (more on this in a moment) that left me with little time at the end to actually push the data through the pipeline and get results. samtools has undergone some rapid improvements since and I suspect this step will no longer pose such a hurdle.

(2) Contribute to `samtools`

As I briefly alluded, during the course of my undergraduate dissertation I authored several pull requests to a popular open-source bioinformatics toolkit known as samtools, which was initially created and continues to be maintained right here at the Sanger Institute. In particular, these pull requests improved documentation and patched some memory leaks for samtools merge and also added naive header parsing for input file metadata to be organised into basic structures for much more efficient iterative access later; significantly improving the time performance of samtools merge.

Header parsing has been a long sought after feature for samtools but none of the core maintainers had the time to put aside to take a good look at the RFC I had submitted. Now I’m in-house and I put a face to a username, catching the most recent samtools steering meeting off-guard, I’ve been tasked to try and get this done before I leave at the end of July.

No pressure.

tl;dr

I live in Cambridge until the end of July, please don’t try and find me in my office.⁴

For what is at least the 10th major house move I’ve made since heading out to university. ↩
As soon as I had my foot in the door I refused to take it away. ↩
The sequencing was conducted at different depths between the two standalone studies and it was suspected this may introduce some bias I didn’t want to deal with. ↩
Not that I’m ever in there, anyway. ↩

samtools – Samposium

Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)

Input file extension pedanticism

A sequence dictionary (and index) is compulsory for use of a FASTA reference

You forgot to index your intermediate BAM

Your temporary directory is probably too small

You may have too many files and not enough file handles

Your maximum Java heap is also too small

The MalformedReadFilter has a looser definition of malformed than expected

tl;dr

Status Report: October 2015

Metagenomes

DNA Quality Control

Side Projects

SAMTools

goldilocks

sunblock

Blog

tl;dr

Sanger Sequel

Aims

(1) Produce a Sequel

(2) Contribute to samtools

tl;dr

The `MalformedReadFilter` has a looser definition of malformed than expected

(2) Contribute to `samtools`