Status Report: November 2016 (Part II): Revisiting the Synthetic Metahaplomes

   Sam Nicholls    No Comments yet    AU-PhD, Status Report

In the opening act of this status report, I described the abrupt and unfortunate end to the adventure of my pre-print. In response to reviewer feedback, I outlined three major tasks that lie ahead in wait for me, blocking the path to an enjoyable Christmas holiday, and a better manuscript submission to a more specialised an alternative journal:

  • Improve Triviomes
    We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.
  • Polish DHFR and AIMP1 analysis
    One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the DHFR and AIMP1 data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.

  • Real Reads
    Create and analyse a data set consisting of real reads.

The first part of this report covers my vented frustrations with both generating, but particularly analysing the new “Treeviomes” in depth. This part will now turn focus to the sequel of my already existing DHFR analysis, addressing the reasoning behind a particularly crap set of recoveries and reassuring myself that perhaps everything is not a disaster after all.

Friends and family, you might be happy to know that I am less grumpy than last week, as a result of the work described in this post. You can skip to the tldr now.

DHFR II: Electric Boogaloo


Our preprint builds upon the initial naive testing of Hansel and Gretel on recovery of randomly generated haplotypes, by introducing a pair of metahaplomes that each contain five haplotypes of varying similarity of a particular gene; namely DHFR and AIMP1.

Testing follows the same formula as we’ve discussed in a previous blog post:

  • Select a master gene (an arbitrary, DHFR, in this case)
  • megaBLAST for similar sequences and select 5 arbitrary sequences of decreasing identity
  • Extract the overlapping regions on those sequences, these are the input haplotypes
  • Generate a random set of (now properly) uniformly distributed reads, of a fixed length and coverage, from each of the input haplotypes
  • Use the master as a pseudo-reference against which to align your synthetic reads
  • Stand idly by as many of your reads are hopelessly discarded by your favourite aligner
  • Call for variants on what is left of your aligned reads by assuming any heterogeneous site is a variant
  • Feed the aligned reads and called variant positions to Gretel
  • Compare the recovered haplotypes, to the input haplotypes, with BLAST

As described in our paper, results for both the DHFR and AIMP1 data sets were promising, with up to 99% of SNPs correctly recovered to successfully reconstruct the input haplotypes. We have shown that haplotypes can be recovered from short reads that originate from a metagenomic data set. However, I want to revisit this analysis to improve the story and fix a few outstanding issues:

  • A reviewer assumed lower recovery rates on haplotypes with poorer identity to the pseudo-reference was purely down to bad decision making by Gretel, and not to do with the long paragraph about how this is caused by discarded reads
  • A reviewer described the tabular results as hard to follow
  • I felt I had not provided a deep enough analysis to readers into the effects of read length and coverage, although no reviewer asked for this, it will help me sleep at night
  • Along the same theme, the tables in the pre-print only provided averages, whereas I felt a diagram might better explain some of the variance observed in recovery rates
  • Additionally, I’d updated the read generator as part of my work on the “Treeviomes” discussed in my last post, particularly to generate more realistic looking distributions of reads, I needed to check results were still reliable and thought it would be helpful if we had a consistent story between our data sets


Continuing the rampant pessimism that can be found in the first half of this status report, I wanted to audit my work on these “synthetic metahaplomes”, particularly given the changed results between the Triviomes, and more biologically realistic and tangible Treeviomes. I wanted to be sure that we had not departed from the good results introduced in the DHFR and AIMP1 sections of the paper.

To be a little less gloomy, I was also inspired by the progress I had made on the analysis of the Treeviomes. I felt we could we could try and uniformly present our results and use box diagrams similar to those that I had finally managed to design and plot at the end of my last post. I feel these diagrams provide a much more detailed insight to the capabilities of Gretel given attributes of read sets such as length and coverage. Additionally, we know each of the input haplotypes, which unlike the uniformly varying sequences of our Treeviomes, have decreasing similarity to the pseudo-reference; thus we can use such a plot to describe Gretel‘s accuracy with respect to each haplotype’s distance from the pseudo-reference (somewhat akin to how we present results for different values of per-haplotype per-base sequence mutation rates on our Treeviomes).

So here’s one I made earlier:

Sam, what the fuck am I looking at?

  • Vertical facets are read length
  • Horizontal facets represent each of the five input haplotypes, ordered by decreasing similarity to the chosen DHFR pseudo-reference
  • X-axis of each boxplot is the average read coverage for each of the ten generated read sets
  • Y-axis is the average best recovery rate for a given haplotype, over the ten runs of Gretel: each boxplot summarising the best recoveries for ten randomly generated uniform distributed sets of reads (with a set read length and coverage)
  • Top scatter plot shows number of called variants across the ten sets of reads for each length-coverage parameter pair
  • Bottom scatter plot shows for each of the ten read sets, the proportion of reads from each haplotype that were dropped during alignment

What can we see?

  • Everything is not awful: three of the five haplotypes can be recovered to around 95% accuracy, even with very short reads (75-100bp), and reasonable average per-haplotype coverage.
  • Good recovery of AK232978, despite a 10% mutation rate when compared to the pseudo-reference: Gretel yielded haplotypes with 80% accuracy.
  • Recovery of XM_012113510 is particularly difficult. Even given long reads (150-250bp) and high coverage (50x) we fail to reach 75% accuracy. Our pre-print hypothesiszed that this was due to its 82.9% identity to the pseudo-reference causing dropped reads.

Our pre-print presents results considering just one read set for each of the length and coverage parameter pairs. Despite the need to perhaps take those results with a pinch of salt, it is somewhat relieving to see that our new more robust experiment yields very similar recovery rates across each of the haplotypes. With the exception of XM_012113510.

The Fall of “XM”

Recovery rates for our synthetic metahaplomes, such as DHFR, are found by BLASTing the haplotypes returned by Gretel against the five known, input haplotypes. The “best haplotype” for an input is defined as the output haplotype with the fewest mismatches against the given input. We report the recovery rate by assuming the mismatches are the incorrectly recovered variants, and divide this by the number of variants called over the read set that yielded this haplotype.

We cannot simply use sequence identity as a score: consider a sequence of 100bp, if there are just 5 variants, all of which are incorrectly reconstructed, Gretel will still appear to be 95% accurate, rather than correctly reporting 0% (1 – 5/5). Note however, that our method does somewhat assume that variants are called perfectly: remember that homogenous sites are ignored by Gretel, so any “missed” SNPs, will always mismatch the input haplotypes, as we “fill in” homogeneous sites when outputting the output haplotype FASTA, with the nucleotides from the pseudo-reference.

XM_012113510 posed an interesting problem for evaluation. When collating results for the pre-print, I found the corresponding best haplotype failed to align the entirity of the gene (564bp) and instead yielded two high scoring, non-overlapping hits. When I updated the harness that calculates and collates recovery rates across all the different runs of Gretel, I overlooked the possibility for this and forced selection of the “best” hit (-max_hsps 1). Apparently, this split-hit behaviour was not a one-off, but almost universal across all runs of Gretel over the 640 sets of reads.

In fact, the split-hit was highly precise. Almost every best XM_012113510 haplotype had a hit along the first 87bp, and another starting at (or very close to) 135bp, stretching across the remainder of the gene. My curiosity piqued, I fell down another rabbit hole.

Curious Coverage

I immediately assumed this must be an interesting side-effect of read dropping. I indicated in our pre-print that we had difficulty aligning reads against a pseudo-reference where those reads originate from a sequence that is dissimilar (<= 90% identity) from the reference. This read dropping phenomenon has been one of the many issues our lab has experiened with current bioinformatics tooling when dealing with metagenomic data sets.

Indeed, our boxplot above features a colour-coded scatter plot that demonstrates that the two most dissimilar input haplotypes AK and XM, are also the haplotypes that experience the most trouble during read alignment. I suggested in the pre-print that thes dropped reads are the underlying reason for Gretel‘s performance on those more dissimilar haplotypes.

I wanted to see empirically whether this was the case, and to potentially find a nice way of providing evidence, given that one of our reviewers missed my attribution of these poorer AK and XM recoveries to trouble with our alignment step, rather than incompetence on Gretel‘s part. I knocked up a small Python script that deciphered my automatically generated read names, and calculated per-haplotype read coverage, for each of the five input haplotypes, across the 640 alignments.

  • X-axis represents genomic position along the psuedo-reference (the “master” `DHFR gene)
  • Y-axis labels each of the input haplotype, starting with the most dissimilar (XM) and ending on the most similar (BC)
  • The coloured heatmap plots the average coverage for a given haplotype at a given position, across all generated BAM files

Well look what we have here… Patches of XM and AK are most definitely poorly supported by reads! This is troubling for their recovery, as Hansel stores pairwise evidence of variants co-occurring on the same read. If those reads are discarded, we have no evidence of that haplotype. It should be no surprise that Gretel has difficultly here. We can’t create evidence from nothing.

What’s this?… A crisp, ungradiated, dark navy box representing absolutely 0 reads across any of the 640 alignments, sitting on our XM gene. If I had to guesstimate the bounds of that box on the X-axis, I would bet my stipend that they’ll be a stonesthrow from 87 and 135bp… The bounds of the split BLAST hits we were reliably generating against XM_012113510, almost 640 times. I delved deeper. What did the XM reads over that region look like? Why didn’t they align correctly? Where they misaligned, or dropped entirely?

But I couldn’t find any.

Having a blast with BLAST

After much head scratching. I began an audit from the beginning, consulting the BLAST record that led me to select XM_012113510.1 for our study:

Seems legit? An approximately 83% hit with full query coverage (564bp) consisting of 46 mismatches and containing an indel of length one? Not quite, as I would later find. But to cut an already long story somewhat shorter. The problem boils down to me being unable to read, again.

  • BLAST outfmt6 alignment lengths include deletions on the subject
  • BLAST outfmt6 tells you the number of gaps, not their size

For fuck’s sake, Sam. Sure enough, here’s the full alignment record. Complete with a fucking gigantic 50bp deletion on the subject, XM_012113510.

It is no wonder we couldn’t recover 100% of the XM_012113510 haplotype, when compared to our pseudo-reference, 10% of it doesn’t fucking exist. Yet, it is interesting to see that the best recovered XM_012113510‘s were identified by BLAST to be very good hits to the XM_012113510 that actually exists in nature, despite my spurious 50bp insertion. Although Gretel is still biased by the structure of the pseudo-reference, (which is one of the reasons that insertions and deletions are still a pain), we are still able to make accurate recoveries around straightforward indel sites like this.

As lovely as it is to have gotten to the bottom of the XM_012113510 recovery troubles, this doesn’t really help our manuscript. We want to avoid complicated discussions and workarounds that may confuse or bore the reader. We already had trouble with our reviewers misunderstanding the purpose of the convoluted Triviome method and its evaluation. I don’t want to have to explain why recovery rates for XM_012113510 need special consideration because of this large indel.

I decided the best option was to find a new member of the band.

DHFR: Reunion Tour

As is the case whenever I am certain something won’t take the entire day, this was not as easy as I anticipated. Whilst it was trivial to BLAST my chosen pseudo-reference once more and find a suitable replacement (reading beyond “query cover” alone this time), trouble arose once more in the form of reads dropped via alignment.

Determined that these sequences should (and must) align, I dedicated the day to finally find some bowtie2 parameters that would more permissively align sequences to my pseudo-reference to ensure Gretel had the necessary evidence to recover even the more dissimilar sequences:


So, what’s next?

Wonderful. So after your productive day of heatmap generation, what are you going to do with your life new found alignment powers?

✓ Select a new DHFR line-up

The door has been blown wide open here. Previously our input test haplotypes had a lower bound of around 90% sequence similarity to avoid the loss of too many reads. However with my definition that I have termed --super-sensitive-local, we can attempt recoveries of genes that are even more dissimilar from the reference! I’ve actually already made the selection, electing to keep BC (99.8%), XR (97.3%) and AK (90.2%). I’ve remove KJ (93.2%) and the troublemaking XM (85.1%) to make way for the excitingly dissimilar M (83.5%) and (a different) XM (78.7%).

✓ Make a pretty picture

After some chitin and Sun Grid Engine wrangling, I present this:

✓ Stare in disbelief

Holy crap. It works.


  • Find what is inevitably wrong with these surprisingly excellent results
  • Get this fucking paper done
  • Merry Christmas


  • In a panic I went back to check my DHFR analysis and then despite it being OK, changed everything anyway
  • One of the genes from our paper is quite hard to recover, because 10% of it is fucking missing
  • Indels continue to be a pain in my arse
  • Contrary to what I said last week, not everything is awful
  • Coverage really is integral to Gretel‘s ability to make good recoveries
  • I’ve constructed some parameters that allow us to permissively align metagenomic reads to our pseudo-reference and shown it to massively improve Gretel‘s accuracy
  • bowtie2 needs some love <3
  • I probably care far too much about integrity of data and results and should just write my paper already
  • A lot of my problems boil down to not being able to read
  • I owe Hadley Wickham a drink to making it so easy to make my graphs look so pretty
  • Everyone agrees this should be my last Christmas as a PhD
  • Tom is mad because I keep telling him about bugs and pitfalls I have find in bioinformatics tools