I wanted a BAM that contained reads aligned to just one of the many contigs the file contained. As usual, I made this much more difficult than it really ought to have been.

This post takes a little look at manually handling BAM files with pysam and perhaps why it was not a good idea for the use case in question. For those who really just want to subset a BAM without the lesson, skip ahead, or consult some appropriate documentation.

Wasting time with `RealignerTargetCreator`, large SQ headers and sparse BAMs

I began by first pulling out the reads associated with a specific contig of interest and writing them to a new BAM with pysam (a htslib interface wrapper for Python). For a header, I prepended the original superset BAM’s header to the result by setting the template parameter to the new AlignmentFile constructor to the name of the open “super” AlignmentFile:

import pysam

# Open original "super" BAM containing all reads
super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Open a new BAM for writing, using the old header
INTERESTING_CONTIG = "my_contig"
contig_bam = pysam.AlignmentFile(INTERESTING_CONTIG+".lazy.bam", "wb", template=super_bam)

# Write reads on the target contig to new file
for read in super_bam.fetch(INTERESTING_CONTIG):
    contig_bam.write(read)

# Housekeeping
super_bam.close()
contig_bam.close()

import pysam

# Open original "super" BAM containing all reads

super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Open a new BAM for writing, using the old header

INTERESTING_CONTIG = "my_contig"

contig_bam = pysam.AlignmentFile(INTERESTING_CONTIG+".lazy.bam", "wb", template=super_bam)

# Write reads on the target contig to new file

for read in super_bam.fetch(INTERESTING_CONTIG):

contig_bam.write(read)

# Housekeeping

super_bam.close()

contig_bam.close()

Although the script had extracted a subset of reads on a given contig as desired, I found downstream that GATK was wasting resources – or more specifically, many hours of my cluster time – processing the hundreds of thousands of other contigs (@SQ lines) listed in the header, despite there being no reads on those contigs.

INFO  10:28:50,300 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  10:28:51,282 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:28:51,283 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:28:51,285 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:28:51,285 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:29:21,290 ProgressMeter - NODE_558_length_1188_cov_10.499158:1201     83534.0    30.0 s       6.0 m        0.0%    42.6 h      42.6 h 
INFO  10:30:21,293 ProgressMeter - NODE_1558_length_375_cov_11.621333:401    243207.0    90.0 s       6.2 m        0.1%    44.4 h      44.4 h 
INFO  10:31:21,295 ProgressMeter - NODE_2578_length_1249_cov_7.622097:1201    379570.0     2.5 m       6.6 m        0.1%    47.4 h      47.3 h
[...]
INFO  10:22:06,562 ProgressMeter - NODE_1498866_length_119_cov_4.596639:101   4.32648722E8    47.9 h       6.6 m       99.9%    47.9 h       2.6 m 
INFO  10:23:06,574 ProgressMeter - NODE_1500560_length_51_cov_4.470588:101   4.32851674E8    47.9 h       6.6 m      100.0%    47.9 h      77.0 s 
INFO  10:24:06,584 ProgressMeter - NODE_1502759_length_114_cov_12.587719:101   4.33032727E8    47.9 h       6.6 m      100.0%    47.9 h       5.0 s 
INFO  10:24:11,379 ProgressMeter -            done   4.3304713E8    47.9 h       6.6 m      100.0%    47.9 h       0.0 s 
INFO  10:24:11,381 ProgressMeter - Total runtime 172520.10 secs, 2875.33 min, 47.92 hours

INFO 10:28:50,300 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files

INFO 10:28:51,282 GenomeAnalysisEngine - Done preparing for traversal

INFO 10:28:51,283 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 10:29:21,290 ProgressMeter - NODE_558_length_1188_cov_10.499158:1201 83534.0 30.0 s 6.0 m 0.0% 42.6 h 42.6 h

INFO 10:30:21,293 ProgressMeter - NODE_1558_length_375_cov_11.621333:401 243207.0 90.0 s 6.2 m 0.1% 44.4 h 44.4 h

INFO 10:31:21,295 ProgressMeter - NODE_2578_length_1249_cov_7.622097:1201 379570.0 2.5 m 6.6 m 0.1% 47.4 h 47.3 h

[...]

INFO 10:22:06,562 ProgressMeter - NODE_1498866_length_119_cov_4.596639:101 4.32648722E8 47.9 h 6.6 m 99.9% 47.9 h 2.6 m

INFO 10:23:06,574 ProgressMeter - NODE_1500560_length_51_cov_4.470588:101 4.32851674E8 47.9 h 6.6 m 100.0% 47.9 h 77.0 s

INFO 10:24:06,584 ProgressMeter - NODE_1502759_length_114_cov_12.587719:101 4.33032727E8 47.9 h 6.6 m 100.0% 47.9 h 5.0 s

INFO 10:24:11,379 ProgressMeter - done 4.3304713E8 47.9 h 6.6 m 100.0% 47.9 h 0.0 s

INFO 10:24:11,381 ProgressMeter - Total runtime 172520.10 secs, 2875.33 min, 47.92 hours

Indeed, for the unsure, we can confirm that a small subset of reads were successfully extracted, but the entire header remains.

$ samtools view -c Limpet-Magda.sorted.bam # The super BAM
365107681

$ samtools view -c NODE_912989_length_238_cov_5.743698.lazy.bam # The new subset BAM
98

$ samtools view -H NODE_912989_length_238_cov_5.743698.lazy.bam | grep -c "^@SQ"
730724

$ samtools view -c Limpet-Magda.sorted.bam # The super BAM

365107681

$ samtools view -c NODE_912989_length_238_cov_5.743698.lazy.bam # The new subset BAM

$ samtools view -H NODE_912989_length_238_cov_5.743698.lazy.bam | grep -c "^@SQ"

730724

This amortizes to around one read per half hour, at which rate I could probably have done the job myself by hand. Evidently, we’d need to provide a smaller header of our own.

Invalidating reads with improper mates

I went back to my pysam script and stripped out all sequence (@SQ) lines from the resulting header that did not match the single contig of interest, taking care to now set the reference_id and next_reference_id (the read mate) of each read to 0: the first and only @SQ line in the new header, our target contig. For reads on the target contig, whose mate was mapped elsewhere, I updated the reference_id to -1: i.e. unmapped. This happened to cause unexpected behaviour downstream, in that I was not expecting everything to be broken:

Exception in thread "main" htsjdk.samtools.SAMFormatException:
    SAM validation error: ERROR: Record 37, Read name <READ NAME>, Mate Alignment start should be 0 because reference name = *.

1 2	Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 37, Read name <READ NAME>, Mate Alignment start should be 0 because reference name = *.

It wouldn’t be until later whilst investigating issues with another tool that I would discover how to correctly update the bit flags and read attributes to mark reads as unmapped as per the BAM specification. But in this instance, the error made me question whether I really wanted to dispose of the information held by reads whose mate appeared on another contig. Figuring this could come in handy later for scaffolding (or just satisfying my curiosity), I needed to find another way to subset the BAM.

Attempting to read a `reference_id` greater than the number of SQ lines unsurprisingly causes samtools segmentation fault

I returned to my hacky script once more. This time, my header was constructed such that it would contain @SQ sequence lines for not only the target contig, but any contig for which reads on the target contig have a mate appearing on too. I did this by discarding the sequence lines that were neither the target contig, or a mate to any reads on the target contig:

import pysam

super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Define target contig and fetch its index in the SQ lines
INTERESTING_CONTIG = "my_contig"
INTERESTING_INDEX = super_bam.references.keys().index(INTERESTING_CONTIG)

# Copy the header but truncate the existing SQ lines
header = super_bam.header.copy()
header["SQ"] = []

# Keep a set of required SQ lines indices
# and add the SQ index of the target contig
required_indices = set()
required_indices.add(INTERESTING_INDEX)

# Parse the reads mapped to the target contig and add
# the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
for ref_index in required_indices:
    header["SQ"].append( super_bam.header["SQ"][ref_index] )

# Open a new BAM for writing, with your new header
contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

import pysam

super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Define target contig and fetch its index in the SQ lines

INTERESTING_CONTIG = "my_contig"

INTERESTING_INDEX = super_bam.references.keys().index(INTERESTING_CONTIG)

# Copy the header but truncate the existing SQ lines

header = super_bam.header.copy()

header["SQ"] = []

# Keep a set of required SQ lines indices

# and add the SQ index of the target contig

required_indices = set()

required_indices.add(INTERESTING_INDEX)

# Parse the reads mapped to the target contig and add

# the index of the contig the mate pair appears on

# (if not the target contig, or unmapped)

for read in super_bam.fetch(INTERESTING_CONTIG):

if read.next_reference_id > 0:

required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the

# original SQ data from the super_bam.header for each

# index harvested from the previous step

for ref_index in required_indices:

header["SQ"].append( super_bam.header["SQ"][ref_index] )

# Open a new BAM for writing, with your new header

contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

This however displeased SAMtools greatly:

$ samtools index NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
Segmentation fault (core dumped)

$ samtools view -H NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
@HD     VN:1.0  SO:coordinate
@SQ     SN:NODE_539672_length_126_cov_5.206349  LN:176
@SQ     SN:NODE_837244_length_378_cov_5.251323  LN:428
@SQ     SN:NODE_912989_length_238_cov_5.743698  LN:288
@SQ     SN:NODE_1101582_length_123_cov_4.081301 LN:173
@SQ     SN:NODE_1140726_length_2383_cov_9.494335        LN:2433
@RG     ID:Limpet-Magda SM:Limpet-Magda PU:Illumina     PL:Illumina
@PG     PN:bowtie2      ID:bowtie2      VN:2.2.3        CL:"[...]"

$ samtools view NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam 
[main_samview] truncated file.

$ samtools index NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam

Segmentation fault (core dumped)

$ samtools view -H NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam

@HD VN:1.0 SO:coordinate

@SQ SN:NODE_539672_length_126_cov_5.206349 LN:176

@SQ SN:NODE_837244_length_378_cov_5.251323 LN:428

@SQ SN:NODE_912989_length_238_cov_5.743698 LN:288

@SQ SN:NODE_1101582_length_123_cov_4.081301 LN:173

@SQ SN:NODE_1140726_length_2383_cov_9.494335 LN:2433

@RG ID:Limpet-Magda SM:Limpet-Magda PU:Illumina PL:Illumina

@PG PN:bowtie2 ID:bowtie2 VN:2.2.3 CL:"[...]"

$ samtools view NODE_912989_length_238_cov_5.743698.with_non-seq_header.bam

[main_samview] truncated file.

As I’d merely re-written the header, keeping each read’s reference_id and next_reference_id intact, I’d inadvertently created an invalid BAM file which causes samtools to seg fault when trying to parse it with samtools view or samtools index. Without getting too technical, samtools expects the length of the list of @SQ lines to equal the index of the largest @SQ line, i.e. the @SQ lines are consecutively numbered¹. Values for both the reference_id and next_reference_id for each read are used by samtools not to refer to the @SQ line with some ID i, but rather the i‘th @SQ line in the list of sequences. This is an important distinction, as having filtered out the majority of sequence lines (the example above contains just 5 of the ~730K original @SQ lines in the superset BAM), I had disturbed the numbering scheme, worse still, I’d made it almost certain that an error would occur when trying to read any file created in the same way.

In the above example, the contig of interest is NODE_912989_length_238_cov_5.743698, whose corresponding reads have a reference_id of 421586. This is not the @SQ line with ID 421586, but the 421586’th sequence in the list of all @SQ lines. Yet as the subset BAM’s first @SQ line, it is addressed as the 0’th sequence in the structure built by samtools during parsing. Later, when attempting to output information on the reads contained in the file, the reference_id of 421586 causes samtools to attempt to access invalid memory — the 421586’th element of a struct with only 5 entries.

samtools elegantly handles my stupidity by segfaulting.

Unordered sequence header causes huge GATK errors

Hacking on a hack, I simply re-numbered the reference_id and next_reference_id attributes of appropriate reads with consecutive integers to match their new @SQ lines. I appended the target contig to the sequence header first and translated IDs for corresponding reads to 0 as I had done earlier. When unseen contigs with mates to a target contig read were encountered, the contig was also appended to the new header and the next_reference_id was overwritten with a new incremental ID:

import pysam

# Load superset BAM
super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Construct an index reference
reference_index = {}
for i, reference in enumerate(super_bam.references):
    reference_index[reference] = i

# Define target contig and fetch its index in the SQ lines
INTERESTING_CONTIG = "my_contig"
INTERESTING_INDEX = references_index[INTERESTING_CONTIG]

# Copy the header but truncate the existing SQ lines
header = super_bam.header.copy()
header["SQ"] = []

# Maintain a map of sequence index translations
#  The key maps the old @SQ line index to the new
#  @SQ line index value for the subset header
header_sq_map = {} 

# As before, keep a set of required SQ lines indices
# We can append the target contig to the header
# and also per-emptively translate it to the 0th SQ
required_indices = set()
header["SQ"].append( super_bam.header["SQ"][INTERESTING_INDEX] )
header_sq_map[INTERESTING_INDEX] = 0

# Parse the reads mapped to the target contig and add
# the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
# This time, we also add an entry in header_sq_map for
# translation on our second loop over the reads
for i, ref_index in enumerate(required_indices):
    header["SQ"].append( super_bam.header["SQ"][ref_index] )
    header_sq_map[ref_index] = i

# Open a new BAM for writing, with your new header
contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

# Fetch, translate and write reads on the target contig to new BAM
for read in super_bam.fetch(INTERESTING_CONTIG):
    read.reference_id = header_sq_map[read.reference_id]
    read.next_reference_id = header_sq_map[read.next_reference_id]
    contig_bam.write(read)

contig_bam.close()
super_bam.close()

import pysam

# Load superset BAM

super_bam = pysam.AlignmentFile("/path/to/my.bam")

# Construct an index reference

reference_index = {}

for i, reference in enumerate(super_bam.references):

reference_index[reference] = i

# Define target contig and fetch its index in the SQ lines

INTERESTING_CONTIG = "my_contig"

INTERESTING_INDEX = references_index[INTERESTING_CONTIG]

# Copy the header but truncate the existing SQ lines

header = super_bam.header.copy()

header["SQ"] = []

# Maintain a map of sequence index translations

# The key maps the old @SQ line index to the new

# @SQ line index value for the subset header

header_sq_map = {}

# As before, keep a set of required SQ lines indices

# We can append the target contig to the header

# and also per-emptively translate it to the 0th SQ

required_indices = set()

header["SQ"].append( super_bam.header["SQ"][INTERESTING_INDEX] )

header_sq_map[INTERESTING_INDEX] = 0

# Parse the reads mapped to the target contig and add

# the index of the contig the mate pair appears on

# (if not the target contig, or unmapped)

for read in super_bam.fetch(INTERESTING_CONTIG):

if read.next_reference_id > 0:

required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the

# original SQ data from the super_bam.header for each

# index harvested from the previous step

# This time, we also add an entry in header_sq_map for

# translation on our second loop over the reads

for i, ref_index in enumerate(required_indices):

header["SQ"].append( super_bam.header["SQ"][ref_index] )

header_sq_map[ref_index] = i

# Open a new BAM for writing, with your new header

contig_bam = pysam.AlignmentFile("my_new.bam", "wb", header=header)

# Fetch, translate and write reads on the target contig to new BAM

for read in super_bam.fetch(INTERESTING_CONTIG):

read.reference_id = header_sq_map[read.reference_id]

read.next_reference_id = header_sq_map[read.next_reference_id]

contig_bam.write(read)

contig_bam.close()

super_bam.close()

This didn’t appear to break samtools as before and after a quick trip through Picard’s MarkDuplicates I packed off 250 subset BAMs on an adventure through the GATK best practice pipeline. The trip abruptly cut short and I was left with a directory containing over 5GB of error logs:

Input files reads and reference have incompatible contigs: The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..

The error helpfully went on to list each of the contigs in the current BAM, along with all ~730K contigs found in the reference FASTA, by name. It appeared GATK did not approve of my somewhat haphazard appending-as-first-encountered approach to the reads-with-mates-not-on-target problem. It appears that the order of the @SQ lines must match that of the appearance of the contigs themselves in the reference FASTA. Presumably this is to ensure quick and easy mapping between the @SQ lines in the BAM and the entries of the reference FASTA index and dictionary.

`ReorderSam` assumes the header is supposed to contain all contigs found in the reference

As appears to be the norm, the GATK error text helpfully links a how-to article that may be of use and notes that the Picard toolkit offers a handy ReorderSam command that is capable of sorting @SQ lines to match the order in which contigs appear in a given reference FASTA, updating the reference IDs of reads and their mates as appropriate. Once again, invocation was simple:

java -jar picard.jar ReorderSam I=<subset.bam> R=<ref.fa> O=<subset.sq_reordered.bam>

1	java -jar picard.jar ReorderSam I=<subset.bam> R=<ref.fa> O=<subset.sq_reordered.bam>

But in bioinformatics simple problems rarely have simple solutions² and ReorderSam had basically reinstated the original superset BAM header:

$ samtools view -H test.bam | grep -c "^@SQ"
49
$ samtools view -H test.sq_reordered.bam | grep -c "^@SQ"
730724

$ samtools view -H test.bam | grep -c "^@SQ"

$ samtools view -H test.sq_reordered.bam | grep -c "^@SQ"

730724

Whilst ReorderSam does indeed perform some re-ordering, I feel it is somewhat of a misnomer and perhaps ReconcileSamRef³ is a more fitting name for the tool. Evidently the tool is primarily used under the assumption that both the input BAM and reference have the same set of contigs, where one may be ordered lexicographically and the other by karyotype. Unfortunately, neither of the two boolean options that can be specified to ReorderSam had the functionality I needed, though one (ALLOW_INCOMPLETE_DICT_CONCORDANCE=True) performed the exact opposite; dropping @SQ lines from the source BAM if they did not appear in the reference. But we’ll return to this option.

Sorted but not solved: GATK `IndelRealigner` upset by `@SQ` lines not matching reference FASTA after all

We can solve the out of order problem rather trivially:

[...]

required_indices = set()
required_indices.add(INTERESTING_INDEX)

# As before parse the reads mapped to the target contig and
# add the index of the contig the mate pair appears on
# (if not the target contig, or unmapped)
for read in super_bam.fetch(INTERESTING_CONTIG):
    if read.next_reference_id > 0:
        required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the
# original SQ data from the super_bam.header for each
# index harvested from the previous step
# This time, we also add an entry in header_sq_map for
# translation on our second loop over the reads
#   Note the sort applied to the required_indices set
#   that solves the issue with @SQ lines appearing
#   out of order against the reference FASTA
for i, ref_index in enumerate(sorted(required_indices)):
    header["SQ"].append( super_bam.header["SQ"][ref_index] )
    header_sq_map[ref_index] = i

[...]

[...]

required_indices = set()

required_indices.add(INTERESTING_INDEX)

# As before parse the reads mapped to the target contig and

# add the index of the contig the mate pair appears on

# (if not the target contig, or unmapped)

for read in super_bam.fetch(INTERESTING_CONTIG):

if read.next_reference_id > 0:

required_indices.add(read.next_reference_id)

# Populate the new header's SQ lines, extracting the

# original SQ data from the super_bam.header for each

# index harvested from the previous step

# This time, we also add an entry in header_sq_map for

# translation on our second loop over the reads

# Note the sort applied to the required_indices set

# that solves the issue with @SQ lines appearing

# out of order against the reference FASTA

for i, ref_index in enumerate(sorted(required_indices)):

header["SQ"].append( super_bam.header["SQ"][ref_index] )

header_sq_map[ref_index] = i

[...]

As we’ve been collecting the indices (that is, the entry at which a contig’s name appears in the @SQ header) all along, we can just use Python’s sorted built-in on the required_indices set. The ensures entries to the header_sq_map dictionary later used to reassign the reference_id and next_reference_id attributes of reads are created with incrementally assigned values in the order that those sequences should appear in the finished header. Tada!

One might have expected that to be the end of things, but whilst these files are now somewhat valid (and can be processed by Picard’s MarkDuplicates and GATK’s RealignerTargetCreator tools), they will typically fail a trip through Picard’s ValidateSamFile. This is a side-effect of our only interest being in reads that appear on the target contig. Although we retain information about mate pairs, including those on other contigs (whose sequence headers are also now included in the @SQ header), we discard the mates themselves along with any other read that does not fall on the specific contig that we target. Indeed, ValidateSamFile raises errors for each of these missing mates:

ERROR: Read name <RNAME>, Mate not found for paired read

Oddly, when I attempted to check whether these reads would fall foul of the infamous MalformedReadFilter with PrintReads (don’t forget, PrintReads automatically applies MalformedReadFilter) a completely different error surfaced⁴:

Badly formed genome loc: Parameters to GenomeLocParser are incorrect: The contig index <x> is bad, doesn’t equal the contig index <y> of the contig from a string <contig>

Busted. Although I’ve successfully ordered the subset of contigs to reflect the order in which they appear in the reference, appeasing both MarkDuplicates and RealignerTargetCreator, there’s no pulling the wool over the eyes of PrintReads. But as it turns out, PrintReads isn’t the only tool in the kit that is capable of seeing through our fraudulent activity. Given that RealignerTargetCreator completes successfully, one would naturally run the next step in the best practice pipeline: the IndelRealigner, which gives exactly the same error.

Same error, different walker: GATK `HaplotypeCaller` also upset by `@SQ` lines not matching reference FASTA verbatim

So what if we were naughty? What if we just want this saga to be over and decide to throw best practice to the wind? We could just skip indel realignment entirely and jump straight to haplotype calling, right? Sadly, GATK has you cornered now. Invoking the HaplotypeCaller with a file treated with our tool yields yet another error:

Rod span <contig>:<pos> isn’t contained within the data shard <contig>:<pos>, meaning we wouldn’t get all of the data we need

On the surface, this error doesn’t appear to give much away. The contig and position that could not be found is repeated twice in the message, but I guess the confusion comes from the jargon and the error boils down to something a pretty simple:

Hey, I looked for contig:pos where contig:pos should be according to the reference and I could not find them, so I don’t have the data I need to do the stuff you told me to do. So, I’m going now. Bye.

Yeah, it’s HaplotypeCaller telling us the same thing as PrintReads and IndelRealigner. Nobody wants our shoddily manufactured BAM file, it violates some underlying assumption that every @SQ line in the header should appear consecutively in the same order as they do in the reference FASTA (and by extension, the reference dictionary and index). Despite our best attempt to renumber the reference_id and next_reference_id attributes of the reads themselves to match a new ordering of just a subset of those @SQ lines, there appears to be no getting around this implicit requirement that the header and reference map 1:1.

I guess this is for the same reason that GATK requires a .dict and .fai file for references as I’ve discussed before, it just makes things a little easier for developers (and their code). In this case, the assumption that each contig reference has a bijective mapping between the BAM header, reference index and reference dictionary means that look ups can simply rely on contig indices: i.e. the i’th @SQ line will also be the i’th entry of the reference index and dictionary.

So, this has been a great exercise in learning more about the BAM specification, pysam and the excessively orderly nature of the GATK, but how are we supposed to correctly subset a BAM? Surely there must be an easier way than all of this?

I downed tools, and did what I should have done much earlier, I read the manual.

How to correctly subset a BAM for analysis

Who’d have thought, wanting to perform analysis on subsets of BAMs is actually quite a common use case that the lovely GATK folks have already considered? It turns out that “subsetting” was perhaps not the keyword to be looking for, but rather “intervals”. In fact a simple search immediately yields a helpful GATK article on when to use interval lists for processing and the GATK command line reference describes the -L or --intervals argument that is accepted by many of the tools to support performing operations on specific parts (or intervals) of the BAM. The -L argument even crops up in the very same pre-processing best practice documents that I was purportedly following for indel realignment:

[RealignerTargetCreator will be] Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.

Sure enough, I can just append the -L argument with the name of my target contig (as it appears in the @SQ header and reference) as a parameter to many of the tools provided by GATK. -L can also be specified multiple times, or even just reference a text file of intervals, too:

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
                               -R <REFERENCE FASTA> \
                               -I <INPUT BAM> \
                               -o <OUTPUT LIST> \
                               -L <CONTIG_1> [-L <CONTIG_2> ... -L <CONTIG_N>]

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \

-R <REFERENCE FASTA> \

-I <INPUT BAM> \

-o <OUTPUT LIST> \

-L <CONTIG_1> [-L <CONTIG_2> ... -L <CONTIG_N>]

Re-running my example from earlier, specifying -L NODE_912989_length_238_cov_5.743698 causes RealignerTargetCreator to run in a matter of minutes instead of almost two days (the actual processing is actually completed in less than a second according to the log below), with an input BAM of over 30GB. I should add that this handy option doesn’t seem to decrease the amount of memory required as the re-run still munched on 32.4GB of RAM — but I guess that’s little to worry about if the job completes in less than five minutes:

INFO  01:02:57,559 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  01:04:24,346 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 86.76 
INFO  01:04:24,464 IntervalUtils - Processing 288 bp from intervals 
INFO  01:04:28,548 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  01:04:28,801 GenomeAnalysisEngine - Done preparing for traversal 
INFO  01:04:28,803 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  01:04:28,804 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  01:04:28,805 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  01:04:29,046 ProgressMeter -            done       288.0     0.0 s      14.1 m       69.4%     0.0 s       0.0 s 
INFO  01:04:29,049 ProgressMeter - Total runtime 0.25 secs, 0.00 min, 0.00 hours

INFO 01:02:57,559 SAMDataSource$SAMReaders - Initializing SAMRecords in serial

INFO 01:04:24,346 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 86.76

INFO 01:04:24,464 IntervalUtils - Processing 288 bp from intervals

INFO 01:04:28,548 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files

INFO 01:04:28,801 GenomeAnalysisEngine - Done preparing for traversal

INFO 01:04:28,803 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 01:04:29,046 ProgressMeter - done 288.0 0.0 s 14.1 m 69.4% 0.0 s 0.0 s

INFO 01:04:29,049 ProgressMeter - Total runtime 0.25 secs, 0.00 min, 0.00 hours

Excellent, I’m sure both my supervisor and system administrator will be pleased.

What about extraction?

It’s all well and good that we can concentrate processing on interesting contigs like this, but what if we reeeaally want to extract and store some reads for a specific contig like we have been trying, can we do it?

Sadly, it seems we’re fresh out of luck. We can abuse PrintReads to parse and write a new BAM, appending the -L argument to our command which will have the side effect of dropping reads that don’t fall on the contig(s) specified. As one would have expected, the output BAM is significantly smaller and demonstrates the correct number of reads (or at least, the read count matches that in the BAM we made for ourselves), so what’s the problem?

$ java -jar GenomeAnalysisTK.jar -T PrintReads \
                                 -R contigs.fa \
                                 -I Limpet-Magda.sorted.bam \
                                 -o NODE_912989_length_238_cov_5.743698.subset.print.bam \
                                 -L NODE_912989_length_238_cov_5.743698

$ samtools view -c NODE_912989_length_238_cov_5.743698.subset.print.bam
98

$ samtools view -H NODE_912989_length_238_cov_5.743698.subset.print.bam | grep -c "^@SQ"
730724

$ java -jar GenomeAnalysisTK.jar -T PrintReads \

-R contigs.fa \

-I Limpet-Magda.sorted.bam \

-o NODE_912989_length_238_cov_5.743698.subset.print.bam \

-L NODE_912989_length_238_cov_5.743698

$ samtools view -c NODE_912989_length_238_cov_5.743698.subset.print.bam

$ samtools view -H NODE_912989_length_238_cov_5.743698.subset.print.bam | grep -c "^@SQ"

730724

We’re back where we started with my own tool, a BAM with the right number of reads but a fully intact header that causes wasted resources. We’ve reached the crux. Unsurprisingly, for the reasons I’ve hypothesized, we just can’t be messing around with the @SQ header if we still want to use the same reference as that used with the super BAM. I briefly toyed with the thought of generating subsets of the reference FASTA itself, to match the new @SQ of each subset BAM. This would definitely appease the tools upset by our trickery, but we’d need to also generate FASTA indexes and dictionaries for each new reference and ensure to provide the right sub-reference for each sub-BAM when conducting analysis later. My bioinformaticy senses tingled, this sounds messy; a sticky plaster over a sticky plaster. I could already see another addendum to a long future blog post forming.

For the time being, I’d achieved what I needed to do, at least in part. I’ve discovered how to focus efforts on specific intervals of interest with the -L argument, saving computational resources along with my own time and sanity. I can now get on with following the GATK best practice pipeline, and if I do encounter a use case that necessitates extraction of reads in the sense of what I initially set out to do, I can spin out a tool to just regenerate a new reference FASTA, dictionary and index[^8], as messy as that may sound.

Though, before you leave here with the conclusion that I can’t even read, I should perhaps jump in to my own defence a little. The reason that I didn’t just set out to operate on a subset (interval!) of the alignment, was in trying to avoid having to define subregions at every step of the analysis pipeline. Although primarily out of laze, the idea was to also avoid having to store all of the reads that weren’t of interest to me in the first place, don’t forget, for our cluster disk space is as much of a scarce commodity as RAM. I also wanted small BAMs (10-100Mb) that could be effortlessly transmitted to others without worrying about bandwidth, hosting or having to offer aftercare to people trapped underneath 365 million reads. Really, I just wanted to quickly and crudely look at some data for myself and I thought it would be easy to roll something small to do the trick with pysam[^7].

~~But I learned my lesson.~~

Update: The following evening

For what it’s worth, the GATK developers got in touch and shared an article describing the generation of example files that contain a subset of reads for a workshop. The tutorial suggests that as per my earlier suspicions, the best way to achieve extraction is to build a subset reference too. Interestingly, extraction and indexing of single contigs from a FASTA can both be done with samtools faidx, which I didn’t realise. The process overall is a little convoluted, for example the BAM header must be extracted (samtools view -H) and manually edited to prune @SQ lines, the BAM must then be converted to SAM to allow Picard ReheaderSam to apply the modified header (and back again later). As with my own example earlier, this process will still leave reads without a mate if the mate appears on a contig which has been filtered out. However, the tutorial does offer a solution to this in the form of Picard’s RevertSam tool, whose (albeit quite destructive) SANITIZE option will forcibly discard reads that cause the SAM to be invalid.

Update: If you aren’t lazy

If you’re happy to make a new reference and just want to extract a bunch of reads for a particular contig, you’re in luck. You can extract the contig with samtools faidx and use ReorderSam‘s ALLOW_INCOMPLETE_DICT_CONCORDANCE option (S=true shorthand) to forcibly drop reads that don’t appear in your new reference. Ta-da!

samtools faidx ref.fa my_contig_name > my_contig.fa
java -jar picard.jar CreateSequenceDictionary R=my_contig.fa O=my_contig.dict
java -jar picard.jar ReorderSam INPUT=super.bam OUTPUT=subset.bam REFERENCE=my_contig.fa S=true VERBOSITY=WARNING

samtools faidx ref.fa my_contig_name > my_contig.fa

java -jar picard.jar CreateSequenceDictionary R=my_contig.fa O=my_contig.dict

java -jar picard.jar ReorderSam INPUT=super.bam OUTPUT=subset.bam REFERENCE=my_contig.fa S=true VERBOSITY=WARNING

Set the verbosity to WARNING to avoid thousands of INFO lines telling you about dropped contigs, and don’t forget to make your sequence dictionary first! Happy subsetting!

tl;dr

Although ReorderSam does perform re-ordering, its name does not communicate its assumption that both the input BAM and reference FASTA share the same set of contigs, that just happen to be ordered differently
You simply cannot chop out swathes of the @SQ header, no matter how well you cover up your tracks
GATK insists you stop mucking about with the BAM header, consider them contaminated as soon as you touch them with your careless fingers
Use the -L parameter to use a GATK tool on a subset of reads in a large BAM
samtools faidx can also extract a contig from a FASTA
Picard RevertSam‘s SANITIZE option can be used to discard reads missing mates (amongst many other things)
Seriously, stop trying to do weird things with BAMs by yourself
But you could use ReorderSam with S=true if you are happy to make a new reference.

$ time java -Xmx3G -jar ~/ware/GenomeAnalysisTK.jar -T PrintReads -I NODE_912989_length_238_cov_5.743698.good.bam -R contigs.fa
[...]
118.96user 2.49system 1:21.35elapsed 149%CPU (0avgtext+0avgdata 3172152maxresident)k

$ time java -Xmx3G -jar ~/ware/GenomeAnalysisTK.jar -T PrintReads -I NODE_912989_length_238_cov_5.743698.good.bam -R contigs.fa

[...]

118.96user 2.49system 1:21.35elapsed 149%CPU (0avgtext+0avgdata 3172152maxresident)k

Although this is somewhat of a tautology; the list’s length is expected to be equal to the greatest ID of the list, where the “ID” of a sequence directly corresponds to that sequence’s position in the list of all @SQ lines, hopefully it helps to explain why removing elements from that list was silly. ↩
This is especially true when you are doing it wrong™. ↩
Narrowly beating my other suggestion of HarmonizeSamHeader. ↩
Incidentally, for a 8.1Kb BAM and a 441Mb contig FASTA, I needed to set the Java virtual machine heap to 3GB. For the record I ran the command with time⁵, note the seemingly insane use of 3172Mb of resident memory for what seems to be a trivial job on the surface. ↩
↩

Wasting time with RealignerTargetCreator, large SQ headers and sparse BAMs

Invalidating reads with improper mates

Attempting to read a reference_id greater than the number of SQ lines unsurprisingly causes samtools segmentation fault

Unordered sequence header causes huge GATK errors

ReorderSam assumes the header is supposed to contain all contigs found in the reference

Sorted but not solved: GATK IndelRealigner upset by @SQ lines not matching reference FASTA after all

Same error, different walker: GATK HaplotypeCaller also upset by @SQ lines not matching reference FASTA verbatim

How to correctly subset a BAM for analysis

What about extraction?

Update: The following evening

Update: If you aren’t lazy

tl;dr

Wasting time with `RealignerTargetCreator`, large SQ headers and sparse BAMs

Attempting to read a `reference_id` greater than the number of SQ lines unsurprisingly causes samtools segmentation fault

`ReorderSam` assumes the header is supposed to contain all contigs found in the reference

Sorted but not solved: GATK `IndelRealigner` upset by `@SQ` lines not matching reference FASTA after all

Same error, different walker: GATK `HaplotypeCaller` also upset by `@SQ` lines not matching reference FASTA verbatim