Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard)

   Sam Nicholls    One Comment    Mysteries, Tools

The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file <ref>.dict for reference <ref>.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov):  A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles1. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

The MalformedReadFilter has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.


tl;dr

  • Your BAM files should end in .bam
  • Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
  • Be indexing, always
  • pysam is a pretty nice package for dealing with SAM/BAM files in Python
  • Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
  • Picard may generously overestimate the number of file handles available
  • GATK is a spoilt child and will have as many file handles as it wants
  • Apply more memory to GATK with java -Xmx:<int>G to avoid running out of heap
  • Remember, the MalformedReadFilter is rather aggressive
  • You need a bigger computer

  1. At the time of writing, 8000.