Duplicate definition error with GATK PrintReads and MalformedReadFilter

   Sam Nicholls    One Comment    Tools

This afternoon I wanted to quickly check1 whether some reads in a BAM would be filtered out by the GATK MalformedReadFilter. As you can’t invoke the filter alone, I figured one of the quickest ways to do this would be to utilise GATK PrintReads, which pretty much parses and spits out input BAMs, while also allowing one to specify filters and the like to be applied to the parser as it dutifully goes by its job of taking up all your cluster’s memory. I entered the command, taking care to specify MalformedRead for the -rf read filter option, feeling particularly pleased with myself for finally being capable of using a GATK command from memory:

GATK, wanting to teach me a lesson for not consulting documentation, quickly dumped a stack trace to my terminal and wiped the smile off my face.

At this point I felt somewhat hopeless, I was actually trying to use the MalformedReadFilter to debug something else, now I was stuck two errors deep surrounded by more Java than I could stomach. Before having a full breakdown about whether bioinformatics really is broken, I remembered I am a little familiar with the filter in question. Indeed, I recognised the filter_reads_with_N_cigar argument from the error as one that can be supplied to the MalformedReadFilter itself. This seems a little odd, where could it be getting a duplicate definition from?

Of course, from my own blog post and the PrintReads manual page, I should have recalled that the MalformedReadFilter is automatically applied by PrintReads. Specifying the same filter on top with -rf apparently causes somewhat of a parsing upset. So there you have it, if you want to check whether your reads will be discarded by the MalformedReadFilter, you can just use PrintReads:

tl;dr

  • GATK PrintReads applies the MalformedReadFilter automatically
  • Specifying -rf MalformedRead to PrintReads is not only redundant but problematic
  • Always read the fucking manual
  • Read your own damn blog
  • GATK is unforgiving

  1. It’s about time I realised that in bioinformatics, nobody has ever successfully “quickly checked” anything.