Tools – Samposium https://samnicholls.net The Exciting Adventures of Sam Sun, 29 Jul 2018 15:17:25 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 bowtie2: Relaxed Parameters for Generous Alignments to Metagenomes https://samnicholls.net/2016/12/24/bowtie2-metagenomes/ https://samnicholls.net/2016/12/24/bowtie2-metagenomes/#respond Sat, 24 Dec 2016 00:34:46 +0000 https://samnicholls.net/?p=1991 In a change to my usual essay length posts, I wanted to share a quick bowtie2 tip for relaxing the parameters of alignment. It’s no big secret that bowtie2 has these options, and there’s some pretty good guidance in the manual, too. However, we’ve had significant trouble in our lab finding a suitable set of permissive alignment parameters.

In the course of my PhD work on haplotyping regions of metagenomes, I have found that even using bowtie2‘s somewhat permissive --very-sensitive-local, that sequences with an identity to the reference of less than 90% are significantly less likely to align back to that reference. This is problematic in my line of work, where I wish to recover all of the individual variants of a gene, as the basis of my approach relies on a set of short reads (50-250bp) aligned to a position on a metagenomic assembly (that I term the pseudo-reference). It’s important to note that I am not interested in the assembly of individual genomes from metagenomic reads, but the genes themselves.

Recently, the opportunity arose to provide some evidence to this. I have some datasets which constitute “synthetic metahaplomes” that consist of a handful of arbitrary known genes that all perform the same function, each from a different organism. These genes can be broken up into synthetic reads and aligned to some common reference (another gene in the same family).

This alignment can be used a means to test my metagenomic haplotyper; Gretel (and her novel brother data structure, Hansel), by attempting to recover the original input sequences, from these synthetic reads. I’ve already reported in my pre-print that our method is at the mercy of the preceding alignment, and used this as the hypothesis for a poor recovery in one of our data sets.

Indeed as part of my latest experiments, I have generated some coverage heat maps, showing the average coverage of each haplotype (Y-axis) at each position of the pseudo-reference (X-axis) and I’ve found that for sequences beyond the vicinity of 90% sequence identity, --very-sensitive-local becomes unsuitable.

The BLAST record below represents the alignment that corresponds to the gene whose reads go on to align at the average coverage depicted at the top bar of the above heatmap. Despite its 79% identity, it looks good(TM) to me, and I need sequence of this level of diversity to align to my pseudo-reference so it can be included in Gretel‘s analysis. I need generous alignment parameters to permit even quite diverse reads (but hopefully not too diverse such that it is no longer a gene of the same family) to map back to my reference. Otherwise Gretel will simply miss these haplotypes.

So despite having already spent many days of my PhD repeatedly failing to increase my overall alignment rates for my metagenomes, I felt this time it would be different. I had a method (my heatmap) to see how my alignment parameters affected the alignment rates of reads on a per-haplotype basis. It’s also taken until now for me to quantify just what sort of sequences we are missing out on, courtesy of dropped reads.

I was determined to get this right.

For a change, I’ll save you the anticipation and tell you what I settled on after about 36 hours of getting cross.

  • --local -D 20 -R 3
    Ensure we’re not performing end-to-end alignment (allow for soft clipping and the like), and borrow the most sensitive default “effort” parameters.
  • -L 3
    The seed substring length. Decreasing this from the default (20 - 25) to just 3 allows for a much more aggressive alignment, but adds computational cost. I actually had reasonably good results with -L 11, which might suit you if you have a much larger data set but still need to relax the aligner.
  • -N 1
    Permit a mismatch in the seed, because why not?
  • --gbar 1
    Has a small, but noticeable effect. Appears to thin the width of some of the coverage gap in the heatmap at the most stubborn sites.
  • --mp 4
    Reduces the maximum penalty that can be applied to a strongly supported (high quality) mismatch by a third (from the default value of 6). The aggregate sum of these penalties are responsible for the dropping of reads. Along with the substring length, this had a significant influence on increasing my alignment rates. If your coverage stains are stubborn, you could decrease this again.

Tada.


tl;dr

  • bowtie2 --local -D 20 -R 3 -L 3 -N 1 -p 8 --gbar 1 --mp 3
]]>
https://samnicholls.net/2016/12/24/bowtie2-metagenomes/feed/ 0 1991
Bioinformatics is a disorganised disaster and I am too. So I made a shell. https://samnicholls.net/2016/11/16/disorganised-disaster/ https://samnicholls.net/2016/11/16/disorganised-disaster/#respond Wed, 16 Nov 2016 17:50:59 +0000 https://samnicholls.net/?p=1581 If you don’t want to hear me wax lyrical about how disorganised I am, you can skip ahead to where I tell you about how great the pseudo-shell that I made and named chitin is.

Back in 2014, about half way through my undergraduate dissertation (Application of Machine Learning Techniques to Next Generation Sequencing Quality Control), I made an unsettling discovery.

I am disorganised.

The discovery was made after my supervisor asked a few interesting questions regarding some of my earlier discarded analyses. When I returned to the data to try and answer those questions, I found I simply could not regenerate the results. Despite the fact that both the code and each “experiment” were tracked by a git repository and I’d written my programs to output (what I thought to be) reasonable logs, I still could not reproduce my science. It could have been anything: an ad-hoc, temporary tweak to a harness script, a bug fix in the code itself masking a result, or any number of other possible untracked changes to the inputs or program parameters. In general, it was clear that I had failed to collect all pertinent metadata for an experiment.

Whilst it perhaps sounds like I was guilty of negligent book-keeping, it really wasn’t for lack of trying. Yet when dealing with many interesting questions at once, it’s so easy to make ad-hoc changes, or perform undocumented command line based munging of input data, or accidentally run a new experiment that clobbers something. Occasionally, one just forgets to make a note of something, or assumes a change is temporary but for one reason or another, the change becomes permanent without explanation. These subtle pipeline alterations are easily made all the time, and can silently invalidate swathes of results generated before (and/or after) them.

Ultimately, for the purpose of reproducibility, almost everything (copies of inputs, outputs, logs, configurations) was dumped and tar‘d for each experiment. But this approach brought problems of its own: just tabulating results was difficult in its own right. In the end, I was pleased with that dissertation, but a small part of me still hurts when I think back to the problem of archiving and analysing those result sets.

It was a nightmare, and I promised it would never happen again.

Except it has.

A relapse of disorganisation

Two years later and I’ve continued to be capable of convincing a committee to allow me to progress towards adding the title of doctor to my bank account. As part of this quest, recently I was inspecting the results of a harness script responsible for generating trivial haplotypes, corresponding reads and attempting to recover them using Gretel. “Very interesting, but what will happen if I change the simulated read size”, I pondered; shortly before making an ad-hoc change to the harness script and inadvertently destroying the integrity of the results I had just finished inspecting by clobbering the input alignment file used as a parameter to Gretel.

Argh, not again.

Why is this hard?

Consider Gretel: she’s not just a simple standalone tool that one can execute to rescue haplotypes from the metagenome. One must go through the motions of pushing their raw reads through some form of pipeline (pictured below) to generate an alignment (to essentially give a co-ordinate system to those reads) and discover the variants (the positions in that co-ordinate system that relate to polymorphisms on reads) that form the required inputs for the recovery algorithm first.

This is problematic for one who wishes to be aware of the providence of all outputs of Gretel, as those outputs depend not only on the immediate inputs (the alignment and called variants), but the entirety of the pipeline that produced them. Thus we must capture as much information as possible regarding all of the steps that occur from the moment the raw reads hit the disk, up to Gretel finishing with extracted haplotypes.

But as I described in my last status report, these tools are themselves non-trivial. bowtie2 has more switches than an average spaceship, and its output depends on its complex set of parameters and inputs (that also have dependencies on previous commands), too.

img_20161110_103257

bash scripts are all well and good for keeping track of a series of commands that yield the result of an experiment, and one can create a nice new directory in which to place such a result at the end – along with any log files and a copy of the harness script itself for good measure. But what happens when future experiments use different pipeline components, with different parameters, or we alter the generation of log files to make way for other metadata? What’s a good directory naming strategy for archiving results anyway? What if parts (or even all of the) analysis are ad-hoc and we are left to reconstruct the history? How many times have you made a manual edit to a malformed file, or had to look up exactly what combination of sed, awk and grep munging you did that one time?

One would have expected me to have learned my lesson by now, but I think meticulous digital lab book-keeping is just not that easy.

What does organisation even mean anyway?

I think the problem is perhaps exacerbated by conflating the meaning of “organisation”. There are a few somewhat different, but ultimately overlapping problems here:

  • How to keep track of how files are created
    What command created file foo? What were the parameters? When was it executed, by whom?
  • Be aware of the role that each file plays in your pipeline
    What commands go on to use file foo? Is it still needed?
  • Assure the ongoing integrity of past and future results
    Does this alignment have reads? Is that FASTA index up to date?
    Are we about to clobber shared inputs (large BAMS, references) that results depend on?
  • Archiving results in a sensible fashion for future recall and comparison
    How can we make it easy to find and analyse results in future?

Indeed, my previous attempts at organisation address some but not all of these points, which is likely the source of my bad feeling. Keeping hold of bash scripts can help me determine how files are created, and the role those files go on to play in the pipeline; but results are merely dumped in a directory. Such directories are created with good intent, and named something that was likely useful and meaningful at the time. Unfortunately, I find that these directories become less and less useful as archive labels as time goes on… For example, what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100/1?

This approach also had no way to assure the current and future integrity of my results. Last month I had an issue with Gretel outputting bizarrely formatted haplotype FASTAs. After chasing my tail trying to find a bug in my FASTA I/O handling, I discovered this was actually caused by an out of date FASTA index (.fai) on the master reference. At some point I’d exchanged one FASTA for another, assuming that the index would be regenerated automatically. It wasn’t. Thus the integrity of experiments using that combination of FASTA+index was damaged. Additionally, the integrity of the results generated using the old FASTA were now also damaged: I’d clobbered the old master input.

There is a clear need to keep better metadata for files, executed commands and results, beyond just tracking everything with git. We need a better way to document the changes a command makes in the file system, and a mechanism to better assure integrity. Finally we need a method to archive experimental results in a more friendly way than a time-sensitive graveyard of timestamps, acronyms and abbreviations.

So I’ve taken it upon myself to get distracted from my PhD to embark on a new adventure to save myself from ruining my PhD2, and fix bioinformatics for everyone.

Approaches for automated command collection

Taking the number of post-its attached to my computer and my sporadically used notebooks as evidence enough to outright skip over the suggestion of a paper based solution to these problems, I see two schools of thought for capturing commands and metadata computationally:

  • Intrusive, but data is structured with perfect recall
    A method whereby users must execute commands via some sort of wrapper. All commands must have some form of template that describes inputs, parameters and outputs. The wrapper then “fills in” the options and dispatches the command on the user’s behalf. All captured metadata has uniform structure and nicely avoids the need to attempt to parse user input. Command reconstruction is perfect but usage is arguably clunky.
  • Unobtrusive, best-effort data collection
    A daemon-like tool that attempts to collect executed commands from the user’s shell and monitor directories for file activity. Parsing command parameters and inputs is done in a naive best-effort scenario. The context of parsed commands and parameters is unknown; we don’t know what a particular command does, and cannot immediately discern between inputs, outputs, flags and arguments. But, despite the lack of structured data, the user does not notice our presence.

There is a trade-off between usability and data quality here. If we sit between a user and all of their commands, offering a uniform interface to execute any piece of software, we can obtain perfectly structured information and are explicitly aware of parameter selections and the paths of all inputs and desired outputs. We know exactly where to monitor for file system changes, and can offer user interfaces that not only merely enumerate command executions, but offer searching and filtering capabilities based on captured parameters: “Show me assemblies that used a k-mer size of 31”.

But we must ask ourselves, how much is that fine-grained data worth to us? Is exchanging our ability to execute commands ourselves worth the perfectly structured data we can get via the wrapper? How much of those parameters are actually useful? Will I ever need to find all my bowtie2 alignments that used 16 threads? There are other concerns here too: templates that define a job specification must be maintained. Someone must be responsible for adding new (or removing old) parameters to these templates when tools are updated. What if somebody happens to misconfigure such a template? More advanced users may be frustrated at being unable to merely execute their job on the command line. Less advanced users could be upset that they can’t just copy and paste commands from the manual or biostars. What about smaller jobs? Must one really define a command template to run trivial tools like awk, sed, tail, or samtools sort through the wrapper?

It turns out I know the answer to this already: the trade-off is not worth it.

Intrusive wrappers don’t work: a sidenote on sunblock

Without wanting to bloat this post unnecessarily, I want to briefly discuss a tool I’ve written previously, but first I must set the scene3.

Within weeks of starting my PhD, I made a computational enemy in the form of Sun Grid Engine: the scheduler software responsible for queuing, dispatching, executing and reporting on jobs submitted to the institute’s cluster. I rapidly became frustrated with having an unorganised collection of job scripts, with ad-hoc edits that meant I could no longer re-run a job previously executed with the same submission script (does this problem sound familiar?). In particular, I was upset with the state of the tools provided by SGE for reporting on the status of jobs.

To cheer myself up, I authored a tool called sunblock, with the goal of never having to look at any component of Sun Grid Engine directly ever again. I was successful in my endeavour and to this day continue to use the tool on the occasion where I need to use the cluster.

screenshot-from-2016-11-16-16-11-11

However, as hypothesised above, sunblock does indeed require an explicit description of an interface for any job that one would wish to submit to the cluster, and it does prevent users from just pasting commands into their terminal. This all-encompassing wrapping feature; that allows us to capture the best, structured information on every job, is also the tool’s complete downfall. Despite the useful information that could be extracted using sunblock (there is even a shiny sunblock web interface), its ability to automatically re-run jobs and the superior reporting on job progress compared to SGE alone, was still not enough to get user traction in our institute.

For the same reason that I think more in-the-know bioinformaticians don’t want to use Galaxy, sunblock failed: because it gets in the way.

Introducing chitin: an awful shell for awful bioinformaticians

Taking what I learned from my experimentation with sunblock on-board, I elected to take the less intrusive, best-effort route to collecting user commands and file system changes. Thus I introduce chitin: a Python based tool that (somewhat)-unobtrusively wraps your system shell, to keep track of commands and file manipulations to address the problem of not knowing how any of the files in your ridiculously complicated bioinformatics pipeline came to be.

I initially began the project with a view to create a digital lab book manager. I envisaged offering a command line tool with several subcommands, one of which could take a command for execution. However as soon as I tried out my prototype and found myself prepending the majority of my commands with lab execute, I wondered whether I could do better. What if I just wrapped the system shell and captured all entered commands? This might seem a rather dumb and long-about way of getting one’s command history, but consider this: if we wrap the system shell as a means to capture all the input, we are also in a position to capture the output for clever things, too. Imagine a shell that could parse the stdout for useful metadata to tag files with…

I liked what I was imagining, and so despite my best efforts to get even just one person to convince me otherwise; I wrote my own pseudo-shell.

chitin is already able to track executed commands that yield changes to the file system. For each file in the chitin tree, there is a full modification history. Better yet, you can ask what series of commands need to be executed in order to recreate a particular file in your workflow. It’s also possible to tag files with potentially useful metadata, and so chitin takes advantage of this by adding the runtime4, and current user to all executed commands for you.

Additionally, I’ve tried to find my own middle ground between the sunblock-esque configurations that yielded superior metadata, and not getting in the way of our users too much. So one may optionally specify handlers that can be applied to detected commands, and captured stdout/stderr. For example, thanks to my bowtie2 configuration, chitin tags my out.sam files with the overall alignment rate (and a few targeted parameters of interest), automatically.

screenshot-from-2016-11-16-17-21-30

chitin also allows you to specify handlers for particular file formats to be applied to files as they are encountered. My environment, for example, is set-up to count the number of reads inside a BAM, and associate that metadata with that version of the file:

screenshot-from-2016-11-16-17-30-55

In this vein, we are in a nice position to check on the status of files before and after a command is executed. To address some of my integrity woes, chitin allows you to define integrity handlers for particular file formats too. Thus my environment warns me if a BAM has 0 reads, is missing an index, or has an index older than itself. Similarly, an empty VCF raises a warning, as does an out of date FASTA index. Coming shortly will be additional checks for whether you are about to clobber a file that is depended on by other files in your workflow. Kinda cool, even if I do say so myself.

Conclusion

Perhaps I’m trying to solve a problem of my own creation. Yet from a few conversations I’ve had with folks in my lab, and frankly, anyone I could get to listen to me for five minutes about managing bioinformatics pipelines, there seems to be sympathy to my cause. I’m not entirely convinced myself that a “shell” is the correct solution here, but it does seem to place us in the best position to get commands entered by the user, with the added bonus of getting stdout to parse for free. Though, judging by the flurry of Twitter activity on my dramatically posted chitin screenshots lately, I suspect I am not so alone in my disorganisation and there are at least a handful of bioinformaticians out there who think a shell isn’t the most terrible solution to this either. Perhaps I just need to be more of a wet-lab biologist.

Either way, I genuinely think there’s a lot of room to do cool stuff here, and to my surprise, I’m genuinely finding chitin quite useful already. If you’d like to try it out, the source for chitin is open and free on GitHub. Please don’t expect too much in the way of stability, though.


tl;dr

  • A definition of “being organised” for science and experimentation is hard to pin down
  • But independent of such a definition, I am terminally disorganised
  • Seriously what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd1001
  • I think command wrappers and platforms like Galaxy get in the way of things too much
  • I wrote a “shell” to try and compensate for this
  • Now I have a shell, it is called chitin

  1. This is a genuine directory in my file system, created about a month ago. It contains results for a run of Gretel against the pol gene on the HIV genome (2084-5083). Off the top of my head, I cannot recall what sd100 is, or why reg appears before the base positions. I honestly tried. 
  2. Because more things that are not my actual PhD is just what my PhD needs. 
  3. If it helps you, imagine some soft jazz playing to the sound of rain while I talk about this gruffly in the dark with a cigarette poking out of my mouth. Oh, and everything is in black and white. It’s bioinformatique noir
  4. I’m quite pleased with this one, because I pretty much always forget to time how long my assemblies and alignments take. 
]]>
https://samnicholls.net/2016/11/16/disorganised-disaster/feed/ 0 1581
Goldilocks: A tool for identifying genomic regions that are “just right” https://samnicholls.net/2016/03/08/goldilocks/ https://samnicholls.net/2016/03/08/goldilocks/#respond Tue, 08 Mar 2016 11:05:10 +0000 https://samnicholls.net/?p=608 application note on Bioinformatics.]]> I’m published! I’m a real scientist now! Goldilocks, my Python package for locating regions on a genome that are “just right” (for some user-provided definition of just right) is published software and you can check out the application note on Bioinformatics Advance Access, download the tool with pip install goldilocks, view the source on Github and read the documentation on readthedocs.

]]>
https://samnicholls.net/2016/03/08/goldilocks/feed/ 0 608
How (not) to subset a BAM for GATK https://samnicholls.net/2016/01/10/not-bam-subset/ https://samnicholls.net/2016/01/10/not-bam-subset/#respond Sun, 10 Jan 2016 22:28:40 +0000 https://samnicholls.net/?p=385 I wanted a BAM that contained reads aligned to just one of the many contigs the file contained. As usual, I made this much more difficult than it really ought to have been.

This post takes a little look at manually handling BAM files with pysam and perhaps why it was not a good idea for the use case in question. For those who really just want to subset a BAM without the lesson, skip ahead, or consult some appropriate documentation.

Wasting time with RealignerTargetCreator, large SQ headers and sparse BAMs

I began by first pulling out the reads associated with a specific contig of interest and writing them to a new BAM with pysam (a htslib interface wrapper for Python). For a header, I prepended the original superset BAM’s header to the result by setting the template parameter to the new AlignmentFile constructor to the name of the open “super” AlignmentFile:

import pysam

# Open original "super" BAM containing all reads
super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Open a new BAM for writing, using the old header
INTERESTING_CONTIG = "my_contig"
contig_bam = pysam.AlignmentFile(INTERESTING_CONTIG+".lazy.bam", "wb", template=super_bam)

# Write reads on the target contig to new file
for read in super_bam.fetch(INTERESTING_CONTIG):
    contig_bam.write(read)

# Housekeeping
super_bam.close()
contig_bam.close()

Although the script had extracted a subset of reads on a given contig as desired, I found downstream that GATK was wasting resources – or more specifically, many hours of my cluster time – processing the hundreds of thousands of other contigs (@SQ lines) listed in the header, despite there being no reads on those contigs.

INFO  10:28:50,300 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  10:28:51,282 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:28:51,283 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:28:51,285 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:28:51,285 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:29:21,290 ProgressMeter - NODE_558_length_1188_cov_10.499158:1201     83534.0    30.0 s       6.0 m        0.0%    42.6 h      42.6 h 
INFO  10:30:21,293 ProgressMeter - NODE_1558_length_375_cov_11.621333:401    243207.0    90.0 s       6.2 m        0.1%    44.4 h      44.4 h 
INFO  10:31:21,295 ProgressMeter - NODE_2578_length_1249_cov_7.622097:1201    379570.0     2.5 m       6.6 m        0.1%    47.4 h      47.3 h
[...]
INFO  10:22:06,562 ProgressMeter - NODE_1498866_length_119_cov_4.596639:101   4.32648722E8    47.9 h       6.6 m       99.9%    47.9 h       2.6 m 
INFO  10:23:06,574 ProgressMeter - NODE_1500560_length_51_cov_4.470588:101   4.32851674E8    47.9 h       6.6 m      100.0%    47.9 h      77.0 s 
INFO  10:24:06,584 ProgressMeter - NODE_1502759_length_114_cov_12.587719:101   4.33032727E8    47.9 h       6.6 m      100.0%    47.9 h       5.0 s 
INFO  10:24:11,379 ProgressMeter -            done   4.3304713E8    47.9 h       6.6 m      100.0%    47.9 h       0.0 s 
INFO  10:24:11,381 ProgressMeter - Total runtime 172520.10 secs, 2875.33 min, 47.92 hours

Indeed, for the unsure, we can confirm that a small subset of reads were successfully extracted, but the entire header remains.

$ samtools view -c Limpet-Magda.sorted.bam # The super BAM
365107681

$ samtools view -c NODE_912989_length_238_cov_5.743698.lazy.bam # The new subset BAM
98

$ samtools view -H NODE_912989_length_238_cov_5.743698.lazy.bam | grep -c "^@SQ"
730724

This amortizes to around one read per half hour, at which rate I could probably have done the job myself by hand. Evidently, we’d need to provide a smaller header of our own.

Invalidating reads with improper mates

I went back to my pysam script and stripped out all sequence (@SQ) lines from the resulting header that did not match the single contig of interest, taking care to now set the reference_id and next_reference_id (the read mate) of each read to 0: the first and only @SQ line in the new header, our target contig. For reads on the target contig, whose mate was mapped elsewhere, I updated the reference_id to -1: i.e. unmapped. This happened to cause unexpected behaviour downstream, in that I was not expecting everything to be broken:

Exception in thread "main" htsjdk.samtools.SAMFormatException:
    SAM validation error: ERROR: Record 37, Read name <READ NAME>, Mate Alignment start should be 0 because reference name = *.

It wouldn’t be until later whilst investigating issues with another tool that I would discover how to correctly update the bit flags and read attributes to mark reads as unmapped as per the BAM specification. But in this instance, the error made me question whether I really wanted to dispose of the information held by reads whose mate appeared on another contig. Figuring this could come in handy later for scaffolding (or just satisfying my curiosity), I needed to find another way to subset the BAM.

Attempting to read a reference_id greater than the number of SQ lines unsurprisingly causes samtools segmentation fault

I returned to my hacky script once more. This time, my header was constructed such that it would contain @SQ sequence lines for not only the target contig, but any contig for which reads on the target contig have a mate appearing on too. I did this by discarding the sequence lines that were neither the target contig, or a mate to any reads on the target contig:

import pysam

super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Define target contig and fetch its index in the SQ lines
INTERESTING_CONTIG = "my_contig"
INTERESTING_INDEX = super_bam.references.keys().index(INTERESTING_CONTIG)

# Copy the header but truncate the existing SQ lines
header = super_bam.header.copy()
header["SQ"] = []

# Keep a set of required SQ lines indices
# and add the SQ index of the target contig
required_indices = set()
required_indices.add(INTERESTING_INDEX)

# Parse the reads mapped to the target contig and add
# the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
for ref_index in required_indices:
    header["SQ"].append( super_bam.header["SQ"][ref_index] )

# Open a new BAM for writing, with your new header
contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

This however displeased SAMtools greatly:

$ samtools index NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
Segmentation fault (core dumped)

$ samtools view -H NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
@HD     VN:1.0  SO:coordinate
@SQ     SN:NODE_539672_length_126_cov_5.206349  LN:176
@SQ     SN:NODE_837244_length_378_cov_5.251323  LN:428
@SQ     SN:NODE_912989_length_238_cov_5.743698  LN:288
@SQ     SN:NODE_1101582_length_123_cov_4.081301 LN:173
@SQ     SN:NODE_1140726_length_2383_cov_9.494335        LN:2433
@RG     ID:Limpet-Magda SM:Limpet-Magda PU:Illumina     PL:Illumina
@PG     PN:bowtie2      ID:bowtie2      VN:2.2.3        CL:"[...]"

$ samtools view NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
[main_samview] truncated file.

As I’d merely re-written the header, keeping each read’s reference_id and next_reference_id intact, I’d inadvertently created an invalid BAM file which causes samtools to seg fault when trying to parse it with samtools view or samtools index. Without getting too technical, samtools expects the length of the list of @SQ lines to equal the index of the largest @SQ line, i.e. the @SQ lines are consecutively numbered1. Values for both the reference_id and next_reference_id for each read are used by samtools not to refer to the @SQ line with some ID i, but rather the i‘th @SQ line in the list of sequences. This is an important distinction, as having filtered out the majority of sequence lines (the example above contains just 5 of the ~730K original @SQ lines in the superset BAM), I had disturbed the numbering scheme, worse still, I’d made it almost certain that an error would occur when trying to read any file created in the same way.

In the above example, the contig of interest is NODE_912989_length_238_cov_5.743698, whose corresponding reads have a reference_id of 421586. This is not the @SQ line with ID 421586, but the 421586’th sequence in the list of all @SQ lines. Yet as the subset BAM’s first @SQ line, it is addressed as the 0’th sequence in the structure built by samtools during parsing. Later, when attempting to output information on the reads contained in the file, the reference_id of 421586 causes samtools to attempt to access invalid memory — the 421586’th element of a struct with only 5 entries.

samtools elegantly handles my stupidity by segfaulting.

Unordered sequence header causes huge GATK errors

Hacking on a hack, I simply re-numbered the reference_id and next_reference_id attributes of appropriate reads with consecutive integers to match their new @SQ lines. I appended the target contig to the sequence header first and translated IDs for corresponding reads to 0 as I had done earlier. When unseen contigs with mates to a target contig read were encountered, the contig was also appended to the new header and the next_reference_id was overwritten with a new incremental ID:

import pysam

# Load superset BAM
super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Construct an index reference
reference_index = {}
for i, reference in enumerate(super_bam.references):
    reference_index[reference] = i

# Define target contig and fetch its index in the SQ lines
INTERESTING_CONTIG = "my_contig"
INTERESTING_INDEX = references_index[INTERESTING_CONTIG]

# Copy the header but truncate the existing SQ lines
header = super_bam.header.copy()
header["SQ"] = []

# Maintain a map of sequence index translations
#  The key maps the old @SQ line index to the new
#  @SQ line index value for the subset header
header_sq_map = {} 

# As before, keep a set of required SQ lines indices
# We can append the target contig to the header
# and also per-emptively translate it to the 0th SQ
required_indices = set()
header["SQ"].append( super_bam.header["SQ"][INTERESTING_INDEX] )
header_sq_map[INTERESTING_INDEX] = 0

# Parse the reads mapped to the target contig and add
# the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
# This time, we also add an entry in header_sq_map for
# translation on our second loop over the reads
for i, ref_index in enumerate(required_indices):
    header["SQ"].append( super_bam.header["SQ"][ref_index] )
    header_sq_map[ref_index] = i

# Open a new BAM for writing, with your new header
contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

# Fetch, translate and write reads on the target contig to new BAM
for read in super_bam.fetch(INTERESTING_CONTIG):
    read.reference_id = header_sq_map[read.reference_id]
    read.next_reference_id = header_sq_map[read.next_reference_id]
    contig_bam.write(read)

contig_bam.close()
super_bam.close()

This didn’t appear to break samtools as before and after a quick trip through Picard’s MarkDuplicates I packed off 250 subset BAMs on an adventure through the GATK best practice pipeline. The trip abruptly cut short and I was left with a directory containing over 5GB of error logs:

Input files reads and reference have incompatible contigs: The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328),  which describes reordering contigs in BAM and VCF files..

The error helpfully went on to list each of the contigs in the current BAM, along with all ~730K contigs found in the reference FASTA, by name. It appeared GATK did not approve of my somewhat haphazard appending-as-first-encountered approach to the reads-with-mates-not-on-target problem. It appears that the order of the @SQ lines must match that of the appearance of the contigs themselves in the reference FASTA. Presumably this is to ensure quick and easy mapping between the @SQ lines in the BAM and the entries of the reference FASTA index and dictionary.

ReorderSam assumes the header is supposed to contain all contigs found in the reference

As appears to be the norm, the GATK error text helpfully links a how-to article that may be of use and notes that the Picard toolkit offers a handy ReorderSam command that is capable of sorting @SQ lines to match the order in which contigs appear in a given reference FASTA, updating the reference IDs of reads and their mates as appropriate. Once again, invocation was simple:

java -jar picard.jar ReorderSam I=<subset.bam> R=<ref.fa> O=<subset.sq_reordered.bam>

But in bioinformatics simple problems rarely have simple solutions2 and ReorderSam had basically reinstated the original superset BAM header:

$ samtools view -H test.bam | grep -c "^@SQ"
49
$ samtools view -H test.sq_reordered.bam | grep -c "^@SQ"
730724

Whilst ReorderSam does indeed perform some re-ordering, I feel it is somewhat of a misnomer and perhaps ReconcileSamRef3 is a more fitting name for the tool. Evidently the tool is primarily used under the assumption that both the input BAM and reference have the same set of contigs, where one may be ordered lexicographically and the other by karyotype. Unfortunately, neither of the two boolean options that can be specified to ReorderSam had the functionality I needed, though one (ALLOW_INCOMPLETE_DICT_CONCORDANCE=True) performed the exact opposite; dropping @SQ lines from the source BAM if they did not appear in the reference. But we’ll return to this option.

Sorted but not solved: GATK IndelRealigner upset by @SQ lines not matching reference FASTA after all

We can solve the out of order problem rather trivially:

[...]

required_indices = set()
required_indices.add(INTERESTING_INDEX)

# As before parse the reads mapped to the target contig and
# add the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
# This time, we also add an entry in header_sq_map for
# translation on our second loop over the reads
#   Note the sort applied to the required_indices set
#   that solves the issue with @SQ lines appearing
#   out of order against the reference FASTA
for i, ref_index in enumerate(sorted(required_indices)):
    header["SQ"].append( super_bam.header["SQ"][ref_index] )
    header_sq_map[ref_index] = i

[...]

As we’ve been collecting the indices (that is, the entry at which a contig’s name appears in the @SQ header) all along, we can just use Python’s sorted built-in on the required_indices set. The ensures entries to the header_sq_map dictionary later used to reassign the reference_id and next_reference_id attributes of reads are created with incrementally assigned values in the order that those sequences should appear in the finished header. Tada!

One might have expected that to be the end of things, but whilst these files are now somewhat valid (and can be processed by Picard’s MarkDuplicates and GATK’s RealignerTargetCreator tools), they will typically fail a trip through Picard’s ValidateSamFile. This is a side-effect of our only interest being in reads that appear on the target contig. Although we retain information about mate pairs, including those on other contigs (whose sequence headers are also now included in the @SQ header), we discard the mates themselves along with any other read that does not fall on the specific contig that we target. Indeed, ValidateSamFile raises errors for each of these missing mates:

ERROR: Read name <RNAME>, Mate not found for paired read

Oddly, when I attempted to check whether these reads would fall foul of the infamous MalformedReadFilter with PrintReads (don’t forget, PrintReads automatically applies MalformedReadFilter) a completely different error surfaced4:

Badly formed genome loc: Parameters to GenomeLocParser are incorrect: The contig index <x> is bad, doesn’t equal the contig index <y> of the contig from a string <contig>

Busted. Although I’ve successfully ordered the subset of contigs to reflect the order in which they appear in the reference, appeasing both MarkDuplicates and RealignerTargetCreator, there’s no pulling the wool over the eyes of PrintReads. But as it turns out, PrintReads isn’t the only tool in the kit that is capable of seeing through our fraudulent activity. Given that RealignerTargetCreator completes successfully, one would naturally run the next step in the best practice pipeline: the IndelRealigner, which gives exactly the same error.

Same error, different walker: GATK HaplotypeCaller also upset by @SQ lines not matching reference FASTA verbatim

So what if we were naughty? What if we just want this saga to be over and decide to throw best practice to the wind? We could just skip indel realignment entirely and jump straight to haplotype calling, right? Sadly, GATK has you cornered now. Invoking the HaplotypeCaller with a file treated with our tool yields yet another error:

Rod span <contig>:<pos> isn’t contained within the data shard <contig>:<pos>, meaning we wouldn’t get all of the data we need

On the surface, this error doesn’t appear to give much away. The contig and position that could not be found is repeated twice in the message, but I guess the confusion comes from the jargon and the error boils down to something a pretty simple:

Hey, I looked for contig:pos where contig:pos should be according to the reference and I could not find them, so I don’t have the data I need to do the stuff you told me to do. So, I’m going now. Bye.

Yeah, it’s HaplotypeCaller telling us the same thing as PrintReads and IndelRealigner. Nobody wants our shoddily manufactured BAM file, it violates some underlying assumption that every @SQ line in the header should appear consecutively in the same order as they do in the reference FASTA (and by extension, the reference dictionary and index). Despite our best attempt to renumber the reference_id and next_reference_id attributes of the reads themselves to match a new ordering of just a subset of those @SQ lines, there appears to be no getting around this implicit requirement that the header and reference map 1:1.

I guess this is for the same reason that GATK requires a .dict and .fai file for references as I’ve discussed before, it just makes things a little easier for developers (and their code). In this case, the assumption that each contig reference has a bijective mapping between the BAM header, reference index and reference dictionary means that look ups can simply rely on contig indices: i.e. the i’th @SQ line will also be the i’th entry of the reference index and dictionary.

So, this has been a great exercise in learning more about the BAM specification, pysam and the excessively orderly nature of the GATK, but how are we supposed to correctly subset a BAM? Surely there must be an easier way than all of this?

I downed tools, and did what I should have done much earlier, I read the manual.

How to correctly subset a BAM for analysis

Who’d have thought, wanting to perform analysis on subsets of BAMs is actually quite a common use case that the lovely GATK folks have already considered? It turns out that “subsetting” was perhaps not the keyword to be looking for, but rather “intervals”. In fact a simple search immediately yields a helpful GATK article on when to use interval lists for processing and the GATK command line reference describes the -L or --intervals argument that is accepted by many of the tools to support performing operations on specific parts (or intervals) of the BAM. The -L argument even crops up in the very same pre-processing best practice documents that I was purportedly following for indel realignment:

[RealignerTargetCreator will be] Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.

Sure enough, I can just append the -L argument with the name of my target contig (as it appears in the @SQ header and reference) as a parameter to many of the tools provided by GATK. -L can also be specified multiple times, or even just reference a text file of intervals, too:

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
                               -R <REFERENCE FASTA> \
                               -I <INPUT BAM> \
                               -o <OUTPUT LIST> \
                               -L <CONTIG_1> [-L <CONTIG_2> ... -L <CONTIG_N>]

Re-running my example from earlier, specifying -L NODE_912989_length_238_cov_5.743698 causes RealignerTargetCreator to run in a matter of minutes instead of almost two days (the actual processing is actually completed in less than a second according to the log below), with an input BAM of over 30GB. I should add that this handy option doesn’t seem to decrease the amount of memory required as the re-run still munched on 32.4GB of RAM — but I guess that’s little to worry about if the job completes in less than five minutes:

INFO  01:02:57,559 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  01:04:24,346 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 86.76 
INFO  01:04:24,464 IntervalUtils - Processing 288 bp from intervals 
INFO  01:04:28,548 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  01:04:28,801 GenomeAnalysisEngine - Done preparing for traversal 
INFO  01:04:28,803 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  01:04:28,804 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  01:04:28,805 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  01:04:29,046 ProgressMeter -            done       288.0     0.0 s      14.1 m       69.4%     0.0 s       0.0 s 
INFO  01:04:29,049 ProgressMeter - Total runtime 0.25 secs, 0.00 min, 0.00 hours

Excellent, I’m sure both my supervisor and system administrator will be pleased.

What about extraction?

It’s all well and good that we can concentrate processing on interesting contigs like this, but what if we reeeaally want to extract and store some reads for a specific contig like we have been trying, can we do it?

Sadly, it seems we’re fresh out of luck. We can abuse PrintReads to parse and write a new BAM, appending the -L argument to our command which will have the side effect of dropping reads that don’t fall on the contig(s) specified. As one would have expected, the output BAM is significantly smaller and demonstrates the correct number of reads (or at least, the read count matches that in the BAM we made for ourselves), so what’s the problem?

$ java -jar GenomeAnalysisTK.jar -T PrintReads \
                                 -R contigs.fa \
                                 -I Limpet-Magda.sorted.bam \
                                 -o NODE_912989_length_238_cov_5.743698.subset.print.bam \
                                 -L NODE_912989_length_238_cov_5.743698

$ samtools view -c NODE_912989_length_238_cov_5.743698.subset.print.bam
98

$ samtools view -H NODE_912989_length_238_cov_5.743698.subset.print.bam | grep -c "^@SQ"
730724


We’re back where we started with my own tool, a BAM with the right number of reads but a fully intact header that causes wasted resources. We’ve reached the crux. Unsurprisingly, for the reasons I’ve hypothesized, we just can’t be messing around with the @SQ header if we still want to use the same reference as that used with the super BAM. I briefly toyed with the thought of generating subsets of the reference FASTA itself, to match the new @SQ of each subset BAM. This would definitely appease the tools upset by our trickery, but we’d need to also generate FASTA indexes and dictionaries for each new reference and ensure to provide the right sub-reference for each sub-BAM when conducting analysis later. My bioinformaticy senses tingled, this sounds messy; a sticky plaster over a sticky plaster. I could already see another addendum to a long future blog post forming.


For the time being, I’d achieved what I needed to do, at least in part. I’ve discovered how to focus efforts on specific intervals of interest with the -L argument, saving computational resources along with my own time and sanity. I can now get on with following the GATK best practice pipeline, and if I do encounter a use case that necessitates extraction of reads in the sense of what I initially set out to do, I can spin out a tool to just regenerate a new reference FASTA, dictionary and index[^8], as messy as that may sound.


Though, before you leave here with the conclusion that I can’t even read, I should perhaps jump in to my own defence a little. The reason that I didn’t just set out to operate on a subset (interval!) of the alignment, was in trying to avoid having to define subregions at every step of the analysis pipeline. Although primarily out of laze, the idea was to also avoid having to store all of the reads that weren’t of interest to me in the first place, don’t forget, for our cluster disk space is as much of a scarce commodity as RAM. I also wanted small BAMs (10-100Mb) that could be effortlessly transmitted to others without worrying about bandwidth, hosting or having to offer aftercare to people trapped underneath 365 million reads. Really, I just wanted to quickly and crudely look at some data for myself and I thought it would be easy to roll something small to do the trick with pysam[^7].

But I learned my lesson.

Update: The following evening


For what it’s worth, the GATK developers got in touch and shared an article describing the generation of example files that contain a subset of reads for a workshop. The tutorial suggests that as per my earlier suspicions, the best way to achieve extraction is to build a subset reference too. Interestingly, extraction and indexing of single contigs from a FASTA can both be done with samtools faidx, which I didn’t realise. The process overall is a little convoluted, for example the BAM header must be extracted (samtools view -H) and manually edited to prune @SQ lines, the BAM must then be converted to SAM to allow Picard ReheaderSam to apply the modified header (and back again later). As with my own example earlier, this process will still leave reads without a mate if the mate appears on a contig which has been filtered out. However, the tutorial does offer a solution to this in the form of Picard’s RevertSam tool, whose (albeit quite destructive) SANITIZE option will forcibly discard reads that cause the SAM to be invalid.

Update: If you aren’t lazy

If you’re happy to make a new reference and just want to extract a bunch of reads for a particular contig, you’re in luck. You can extract the contig with samtools faidx and use ReorderSam‘s ALLOW_INCOMPLETE_DICT_CONCORDANCE option (S=true shorthand) to forcibly drop reads that don’t appear in your new reference. Ta-da!

samtools faidx ref.fa my_contig_name > my_contig.fa
java -jar picard.jar CreateSequenceDictionary R=my_contig.fa O=my_contig.dict
java -jar picard.jar ReorderSam INPUT=super.bam OUTPUT=subset.bam REFERENCE=my_contig.fa S=true VERBOSITY=WARNING

Set the verbosity to WARNING to avoid thousands of INFO lines telling you about dropped contigs, and don’t forget to make your sequence dictionary first! Happy subsetting!

tl;dr

  • Although ReorderSam does perform re-ordering, its name does not communicate its assumption that both the input BAM and reference FASTA share the same set of contigs, that just happen to be ordered differently
  • You simply cannot chop out swathes of the @SQ header, no matter how well you cover up your tracks
  • GATK insists you stop mucking about with the BAM header, consider them contaminated as soon as you touch them with your careless fingers
  • Use the -L parameter to use a GATK tool on a subset of reads in a large BAM
  • samtools faidx can also extract a contig from a FASTA
  • Picard RevertSam‘s SANITIZE option can be used to discard reads missing mates (amongst many other things)
  • Seriously, stop trying to do weird things with BAMs by yourself
  • But you could use ReorderSam with S=true if you are happy to make a new reference.

$ time java -Xmx3G -jar ~/ware/GenomeAnalysisTK.jar -T PrintReads -I NODE_912989_length_238_cov_5.743698.good.bam -R contigs.fa
[...]
118.96user 2.49system 1:21.35elapsed 149%CPU (0avgtext+0avgdata 3172152maxresident)k


  1. Although this is somewhat of a tautology; the list’s length is expected to be equal to the greatest ID of the list, where the “ID” of a sequence directly corresponds to that sequence’s position in the list of all @SQ lines, hopefully it helps to explain why removing elements from that list was silly. 
  2. This is especially true when you are doing it wrong™
  3. Narrowly beating my other suggestion of HarmonizeSamHeader
  4. Incidentally, for a 8.1Kb BAM and a 441Mb contig FASTA, I needed to set the Java virtual machine heap to 3GB. For the record I ran the command with time5, note the seemingly insane use of 3172Mb of resident memory for what seems to be a trivial job on the surface. 
]]>
https://samnicholls.net/2016/01/10/not-bam-subset/feed/ 0 385
Duplicate definition error with GATK PrintReads and MalformedReadFilter https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/ https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/#comments Thu, 07 Jan 2016 19:27:17 +0000 https://samnicholls.net/?p=468 This afternoon I wanted to quickly check1 whether some reads in a BAM would be filtered out by the GATK MalformedReadFilter. As you can’t invoke the filter alone, I figured one of the quickest ways to do this would be to utilise GATK PrintReads, which pretty much parses and spits out input BAMs, while also allowing one to specify filters and the like to be applied to the parser as it dutifully goes by its job of taking up all your cluster’s memory. I entered the command, taking care to specify MalformedRead for the -rf read filter option, feeling particularly pleased with myself for finally being capable of using a GATK command from memory:

java -jar GenomeAnalysisTK.jar -T PrintReads -rf MalformedRead -I <INPUT> -R <REFERENCE>

GATK, wanting to teach me a lesson for not consulting documentation, quickly dumped a stack trace to my terminal and wiped the smile off my face.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Duplicate definition of argument with full name: filter_reads_with_N_cigar
        at org.broadinstitute.gatk.utils.commandline.ArgumentDefinitions.add(ArgumentDefinitions.java:59)
        at org.broadinstitute.gatk.utils.commandline.ParsingEngine.addArgumentSource(ParsingEngine.java:150)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:207)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Duplicate definition of argument with full name: filter_reads_with_N_cigar
##### ERROR ------------------------------------------------------------------------------------------

At this point I felt somewhat hopeless, I was actually trying to use the MalformedReadFilter to debug something else, now I was stuck two errors deep surrounded by more Java than I could stomach. Before having a full breakdown about whether bioinformatics really is broken, I remembered I am a little familiar with the filter in question. Indeed, I recognised the filter_reads_with_N_cigar argument from the error as one that can be supplied to the MalformedReadFilter itself. This seems a little odd, where could it be getting a duplicate definition from?

Of course, from my own blog post and the PrintReads manual page, I should have recalled that the MalformedReadFilter is automatically applied by PrintReads. Specifying the same filter on top with -rf apparently causes somewhat of a parsing upset. So there you have it, if you want to check whether your reads will be discarded by the MalformedReadFilter, you can just use PrintReads:

java -jar GenomeAnalysisTK.jar -T PrintReads I <INPUT> -R <REFERENCE>

tl;dr

  • GATK PrintReads applies the MalformedReadFilter automatically
  • Specifying -rf MalformedRead to PrintReads is not only redundant but problematic
  • Always read the fucking manual
  • Read your own damn blog
  • GATK is unforgiving

  1. It’s about time I realised that in bioinformatics, nobody has ever successfully “quickly checked” anything. 
]]>
https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/feed/ 1 468
Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard) https://samnicholls.net/2015/11/11/grokking-gatk/ https://samnicholls.net/2015/11/11/grokking-gatk/#comments Wed, 11 Nov 2015 16:11:50 +0000 https://samnicholls.net/?p=336 The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file <ref>.dict for reference <ref>.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

java -jar ~/git/picard-tools-1.138/picard.jar CreateSequenceDictionary R=<ref>.fa O=<ref>.dict

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov):  A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

java -Djava.io.tmpdir=/not/a/shoebox/ -jar <jar> <tool> ...

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles1. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

$ ulimit -n
1024

$ java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ...

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

java -Xmx:<int>G -jar <jar> <tool> ...

The MalformedReadFilter has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.


tl;dr

  • Your BAM files should end in .bam
  • Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
  • Be indexing, always
  • pysam is a pretty nice package for dealing with SAM/BAM files in Python
  • Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
  • Picard may generously overestimate the number of file handles available
  • GATK is a spoilt child and will have as many file handles as it wants
  • Apply more memory to GATK with java -Xmx:<int>G to avoid running out of heap
  • Remember, the MalformedReadFilter is rather aggressive
  • You need a bigger computer

  1. At the time of writing, 8000. 
]]>
https://samnicholls.net/2015/11/11/grokking-gatk/feed/ 1 336
`memblame` https://samnicholls.net/2015/04/26/memblame/ https://samnicholls.net/2015/04/26/memblame/#respond Sun, 26 Apr 2015 10:06:36 +0000 http://blog.ironowl.io/?p=258 As a curious and nosy individual who likes to know everything, I wrote a script dubbed memblame which is responsible for naming and shaming authors of “inefficient”1 jobs at our cluster here in IBERS.

It takes time, often days, sometimes longer, of patience to see large-input jobs executed on a node on the compute cluster here. Typically this is down to the amount of RAM requested, only a handful of nodes are actually capable of scheduling jobs that have a RAM quota of 250GB or larger. But these nodes are often busy with other tasks too.

One dreary afternoon while waiting a particularly long time for an assembly to pop off the queue and begin, I started to wonder what the hold up was.

Our cluster is underpinned by Sun Grid Engine (SGE), a piece of software entrusted with scheduling and management of submitted jobs that over the past few months I have formed a strong opinion on2. When a job completes (regardless of exit status), SGE stores associated job meta-data in plain-text in an “accounting” logfile on the cluster’s root node.

The file appeared trivially parseable3 and offered numerous fields for every job submitted to the node since its last boot4. Primed for procrastination with mischief and curiosity, I knocked up a Python-based parser and delivered memblame.

The script dumps out a table detailing each job with the following fields as columns:

Field Description
jid SGE Job ID
node Hostname of Execution Node
name Name of Job Script
user Username of Author
gbmem_req GB RAM Requested
gbmem_used GB RAM Used
delta_gbmem ΔGB RAM (Requested − Used)
pct_mem %GB Requested RAM Utilised
time Execution Duration
gigaram_hours GB RAM Used × Execution Hours
wasted_gigaram_hours GB RAM Unused × Execution Hours
exit Exit Status (0 if success)

The table introduces the concept of wasted_gigaram_hours, defined as the number of RAM gigabytes unused (where RAM “used” is defined as equal to peak RAM usage as measured by the scheduler over the duration of the job5, unused therefore being the difference between RAM requested and utilised; delta_gbmem) multiplied by the number of hours the job ran for. Thus a job that over-requested 1GB of RAM and runs for a day, “wastes” 24 GB Hours!

I created this additional field in an attempt to more fairly compare different classes of job that take vastly different execution times to complete. i.e. Jobs that use (and over-request) large amounts of RAM but for a short time should not necessarily be shamed more than smaller jobs that over-request less RAM for a much longer period of time.

Incidentally, at the time of publishing the 1st Monthly MemBlame Leaderboard, no matter on the field used to order the rankings, a member of our team who shall remain nameless6 won the gold medal for wastage.

Though it wasn’t necessarily the top of the list that was interesting. Although naming and shaming those responsible for ridiculous RAM wastage (~0.76 TB Day-1 over 11 days6) on an assembly job that didn’t even complete successfully6 is fun in jest, memblame revealed user behaviours such as a tendancy to request the default amount of RAM for small jobs such as BLASTing — up to ~5x more RAM than necessary — which easily tied up resources on smaller nodes when running many of these jobs in parallel. In the long run I’d like to use this sort of data to improve guess-timates on resource requests for large and long running jobs in an attempt to reduce resource hogging for significant periods of time when completing big assemblies and alignments.

I should add that “wasted RAM” is just one of the many dimensions we could look at when discussing job “efficiency”7. I chose to look at RAM underuse for this particular situation as in my opinion it appears to be the weakest resource in the setup that we have and the one with which users seem to struggle the most in estimating usage of.

If nothing else it promotes a healthy discussion about the efficiency of the tools that we are using and the opportunity to poke some light hearted fun at people who lock up 375GB of RAM over the course of two hours running a poorly parameterised sort8.


tl;dr

  • I wrote a script to name and shame people who asked for more RAM than they needed.

  1. Although properly determining a metric to fairly represent efficiency is a task in itself. 
  2. I’m also writing software with the sole purpose of abstracting away having to deal with SGE entirely. 
  3. In fact the hardest part was digging around to locate a manual to actually decipher what each field represented and how to translate them to something human readable. 
  4. Which seems to be correlated with the date of Aberystwyth’s last storm. 
  5. It’s likely that jobs are even less “efficient” than as reported by memblame as scripts probably don’t uniformly utilise memory used over a job’s lifetime. Unfortunately max_vmem is the only metric for RAM utilisation that can be extracted from SGE’s accounting file. 
  6. I’m sorry, Tom. 
  7. Although properly determining a metric to fairly represent efficiency is a task in itself. 
  8. That one was me. 
]]>
https://samnicholls.net/2015/04/26/memblame/feed/ 0 258