This afternoon I wanted to quickly check1 whether some reads in a BAM would be filtered out by the GATK MalformedReadFilter
. As you can’t invoke the filter alone, I figured one of the quickest ways to do this would be to utilise GATK PrintReads
, which pretty much parses and spits out input BAMs, while also allowing one to specify filters and the like to be applied to the parser as it dutifully goes by its job of taking up all your cluster’s memory. I entered the command, taking care to specify MalformedRead
for the -rf
read filter option, feeling particularly pleased with myself for finally being capable of using a GATK command from memory:
1 |
java -jar GenomeAnalysisTK.jar -T PrintReads -rf MalformedRead -I <INPUT> -R <REFERENCE> |
GATK, wanting to teach me a lesson for not consulting documentation, quickly dumped a stack trace to my terminal and wiped the smile off my face.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Duplicate definition of argument with full name: filter_reads_with_N_cigar at org.broadinstitute.gatk.utils.commandline.ArgumentDefinitions.add(ArgumentDefinitions.java:59) at org.broadinstitute.gatk.utils.commandline.ParsingEngine.addArgumentSource(ParsingEngine.java:150) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:207) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Duplicate definition of argument with full name: filter_reads_with_N_cigar ##### ERROR ------------------------------------------------------------------------------------------ |
At this point I felt somewhat hopeless, I was actually trying to use the MalformedReadFilter
to debug something else, now I was stuck two errors deep surrounded by more Java than I could stomach. Before having a full breakdown about whether bioinformatics really is broken, I remembered I am a little familiar with the filter in question. Indeed, I recognised the filter_reads_with_N_cigar
argument from the error as one that can be supplied to the MalformedReadFilter
itself. This seems a little odd, where could it be getting a duplicate definition from?
Of course, from my own blog post and the PrintReads
manual page, I should have recalled that the MalformedReadFilter
is automatically applied by PrintReads
. Specifying the same filter on top with -rf
apparently causes somewhat of a parsing upset. So there you have it, if you want to check whether your reads will be discarded by the MalformedReadFilter
, you can just use PrintReads
:
1 |
java -jar GenomeAnalysisTK.jar -T PrintReads I <INPUT> -R <REFERENCE> |
tl;dr
- GATK
PrintReads
applies theMalformedReadFilter
automatically - Specifying
-rf MalformedRead
toPrintReads
is not only redundant but problematic - Always read the fucking manual
- Read your own damn blog
- GATK is unforgiving
- It’s about time I realised that in bioinformatics, nobody has ever successfully “quickly checked” anything. ↩