metagenome – Samposium

bowtie2: Relaxed Parameters for Generous Alignments to Metagenomes

Sam — Sat, 24 Dec 2016 00:34:46 +0000

In a change to my usual essay length posts, I wanted to share a quick bowtie2 tip for relaxing the parameters of alignment. It’s no big secret that bowtie2 has these options, and there’s some pretty good guidance in the manual, too. However, we’ve had significant trouble in our lab finding a suitable set of permissive alignment parameters.

In the course of my PhD work on haplotyping regions of metagenomes, I have found that even using bowtie2‘s somewhat permissive --very-sensitive-local, that sequences with an identity to the reference of less than 90% are significantly less likely to align back to that reference. This is problematic in my line of work, where I wish to recover all of the individual variants of a gene, as the basis of my approach relies on a set of short reads (50-250bp) aligned to a position on a metagenomic assembly (that I term the pseudo-reference). It’s important to note that I am not interested in the assembly of individual genomes from metagenomic reads, but the genes themselves.

Recently, the opportunity arose to provide some evidence to this. I have some datasets which constitute “synthetic metahaplomes” that consist of a handful of arbitrary known genes that all perform the same function, each from a different organism. These genes can be broken up into synthetic reads and aligned to some common reference (another gene in the same family).

This alignment can be used a means to test my metagenomic haplotyper; Gretel (and her novel brother data structure, Hansel), by attempting to recover the original input sequences, from these synthetic reads. I’ve already reported in my pre-print that our method is at the mercy of the preceding alignment, and used this as the hypothesis for a poor recovery in one of our data sets.

Indeed as part of my latest experiments, I have generated some coverage heat maps, showing the average coverage of each haplotype (Y-axis) at each position of the pseudo-reference (X-axis) and I’ve found that for sequences beyond the vicinity of 90% sequence identity, --very-sensitive-local becomes unsuitable.

The BLAST record below represents the alignment that corresponds to the gene whose reads go on to align at the average coverage depicted at the top bar of the above heatmap. Despite its 79% identity, it looks good(TM) to me, and I need sequence of this level of diversity to align to my pseudo-reference so it can be included in Gretel‘s analysis. I need generous alignment parameters to permit even quite diverse reads (but hopefully not too diverse such that it is no longer a gene of the same family) to map back to my reference. Otherwise Gretel will simply miss these haplotypes.

So despite having already spent many days of my PhD repeatedly failing to increase my overall alignment rates for my metagenomes, I felt this time it would be different. I had a method (my heatmap) to see how my alignment parameters affected the alignment rates of reads on a per-haplotype basis. It’s also taken until now for me to quantify just what sort of sequences we are missing out on, courtesy of dropped reads.

I was determined to get this right.

Playing fucking mastermind with bowtie2 parameters pic.twitter.com/yDHduJPBNM

— Sam Nicholls (@samstudio8) December 23, 2016

For a change, I’ll save you the anticipation and tell you what I settled on after about 36 hours of getting cross.

--local -D 20 -R 3
Ensure we’re not performing end-to-end alignment (allow for soft clipping and the like), and borrow the most sensitive default “effort” parameters.
-L 3
The seed substring length. Decreasing this from the default (20 - 25) to just 3 allows for a much more aggressive alignment, but adds computational cost. I actually had reasonably good results with -L 11, which might suit you if you have a much larger data set but still need to relax the aligner.
-N 1
Permit a mismatch in the seed, because why not?
--gbar 1
Has a small, but noticeable effect. Appears to thin the width of some of the coverage gap in the heatmap at the most stubborn sites.
--mp 4
Reduces the maximum penalty that can be applied to a strongly supported (high quality) mismatch by a third (from the default value of 6). The aggregate sum of these penalties are responsible for the dropping of reads. Along with the substring length, this had a significant influence on increasing my alignment rates. If your coverage stains are stubborn, you could decrease this again.

Tada.

tl;dr

bowtie2 --local -D 20 -R 3 -L 3 -N 1 -p 8 --gbar 1 --mp 3

Status Report: November 2016 (Part I): Triviomes, Treeviomes & Fuck Everything

Sam — Mon, 19 Dec 2016 23:14:33 +0000

Once again, I have adequately confounded progress since my last report to both myself, and my supervisorial team such that it must be outlaid here. Since I’ve got back from having a lovely time away from bioinformatics, the focus has been to build on top of our highly shared but unfortunately rejected pre-print: Advances in the recovery of haplotypes from the metagenome.

I’d hoped to have a new-and-improved draft ready by Christmas, in time for an invited talk at Oxford, but sadly I’ve had to postpone both. Admittedly, it has taken quite some time for me to dust myself down after having the entire premise of my PhD so far rejected without re-submission, but I have finally built up the motivation to revisit what is quite a mammoth piece of work, and am hopeful that I can take some of the feedback on board to rein in the new year with an even better paper.

This will likely be the final update of the year.
This is also the last Christmas I hope to be a PhD candidate.

Friends and family can skip to the tldr

The adventure continues…

We left off with a lengthy introduction to my novel data structure; Hansel and algorithm; Gretel. In that post I briefly described some of the core concepts of my approach, such as how the Hansel matrix is reweighted after Gretel successfully creates a path (haplotype), how we automatically select a suitable value for the “lookback” parameter (i.e. the order of the Markov chain used when calculating probabilities for the next variant of a haplotype), and the current strategy for smoothing.

In particular, I described our current testing methodologies. In the absence of metagenomic data sets with known haplotypes, I improvised two strategies:

Trivial Haplomes (Triviomes)
Data sets designed to be finely controlled, and well-defined. Short, random haplotypes and sets of reads are generated. We also generate the alignment and variant calls automatically to eliminate noise arising from the biases of external tools. These data sets are not expected to be indicative of performance on actual sequence data, but rather represent a platform on which we can test some of the limitations of the approach.
Synthetic Metahaplomes
Designed to be more representative of the problem, we generate synthetic reads from a set of similar genes. The goal is to recover the known input genes, from an alignment of their reads against a pseudo-reference.

I felt our reviewers misunderstood both the purpose and results of the “triviomes”. In retrospect, this was probably due to the (albeit intentional) lack of any biological grounding distracting readers from the story at hand. The trivial haplotypes were randomly generated, such that none of them had any shared phylogeny. Every position across those haplotypes was deemed a SNP, and were often tetra-allelic. The idea behind this was to cut out the intermediate stage of needing to remove homogeneous positions across the haplotypes (or in fact, from even having to generate haplotypes that had homogeneous positions). Generated reads were thus seemingly unrealistic, at a length of 3-5bp. However they meant to represent not a 3-5bp piece of sequence, but the 3-5bp sequence that remains when one only considers genomic positions with variation, i.e. our reads were simulated such they spanned between 3 and 5 SNPs of our generated haplotypes.

I believe these confusing properties and their justifications got in the way of expressing their purpose, which was not to emulate the real metahaplotying problem, but to introduce some of the concepts and limitations of our approach in a controlled environment.

Additionally, our reviewers argued that the paper is lacking an extension to the evaluation of synthetic metahaplomes: data sets that contain real sequencing reads. Indeed, I felt that this was probably the largest weakness of my own paper, especially as it would not require an annotated metagenome. Though, I had purposefully stayed on the periphery of simulating a “proper” metagenome, as there are ongoing arguments in the literature as to the correct methodology and I wanted to avoid the simulation itself being used against our work. That said, it would be prudent to at least present small synthetic metahaplomes akin to the DHFR and AIMP1, using real reads.

So this leaves us with a few major plot points to work on before I can peddle the paper elsewhere:

Improve Triviomes
We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.

Polish DHFR and AIMP1 analysis
One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the *DHFR* and *AIMP1* data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.

Real Reads
Create and analyse a data set consisting of real reads.

The remainder of this post will focus on the first point, because otherwise no-one will read it.

Triviomes and Treeviomes

After a discussion about how my Triviomes did not pay off, where I believe I likened them to “random garbage”. It was clear that we needed a different tactic to introduce this work. Ideally this would be something simple enough that we could still have total control over both the metahaplome to be recovered, and the reads to recover it from, but also yield a simpler explanation for our readers.

My biology-sided supervisor, Chris, is an evolutionary biologist with a fetish for trees. Throughout my PhD so far, I have managed to steer away from phylogenetic trees and the like, especially after my terrifying first year foray into taxonomy, where I discovered that not only can nobody agree on what anything is, or where it should go, but there are many ways to skin a cat draw a tree.

Previously, I presented the aggregated recovery rates of randomly generated metahaplomes, for a series of experiments, where I varied the number of haplotypes, and their length. Remember that every position of these generated haplotypes was a variant. Thus, one may argue that the length of these random haplotypes was a poor proxy for genetic diversity. That is, we increased the number of variants (by making longer haplotypes) to artificially increase the level of diversity in the random metahaplome, and make recoveries more difficult. Chris pointed out that actually, we could specify and fix the level of diversity, and generate our haplotypes according to some… tree.

This seemed like an annoyingly neat and tidy solution to my problem. Biologically speaking, this is a much easier explanation to readers; our sequences will have meaning, our reads will look somewhat more realistic and most importantly, the recovery goal is all the more tangible. Yet at the same time, we still have precise control over the tree, and we can generate the synthetic reads in exactly the same way as before, allowing us to maintain tight control of their attributes. So, despite my aversion to anything that remotely resembles a dendrogram, on this occasion, I have yielded. I introduce the evaluation strategy to supplant¹ my Triviomes: Treeviomes.

(Brief) Methodology

Heartlessly throw the Triviomes section in the bin
Generate a random start DNA sequence
Generate a Newick format tree. The tree is a representation of the metahaplome that we will attempt to recover. Each branch (taxa) of the tree corresponds to a haplotype. The shape of the tree will be a star, with each branch of uniform length. Thus, the tree depicts a number of equally diverse taxa from a shared origin
Use the tree to simulate evolution of the start DNA sequence to create the haplotypes that comprise the synthetic metahaplome
As before, generate reads (of a given length, at some level of coverage) from each haplotype, and automatically generate the alignment (we know where our generated reads should start and end on the reference without external tools) and variant calls (any heterogeneous genomic position when the reads are piled up)
Rinse and repeat, make pretty pictures

The foundation for this part of the work is set. Chris even recommended seq-gen as a tool that can simulate evolution from a starting DNA sequence, following a Newick tree, which I am using to generate our haplotypes. So I now have a push-buttan-to-metahaplome workflow that generates the necessary tree, haplotypes, and reads for testing Gretel.

I’ve had two main difficulties with Treeviomes…

• Throughput

Once again, running anything thousands of times has proven the bane of my life. Despite having a well defined workflow to generate and test a metahaplome, getting the various tools and scripts to work on the cluster here has been a complete pain in my arse. So much so, I ended up generating all of the data on my laptop (sequentially, over the course of a few days) and merely uploading the final BAMs and VCFs to our compute cluster to run Gretel. This has been pretty frustrating, especially when last weekend I set my laptop to work on creating a few thousand synthetic metahaplomes and promised some friends that I’d take the weekend off work for a change, only to find on Monday that my laptop had done exactly the same.

Option 1: Hard shitty hack to run quickly on cluster
Option 2: Easy shitty hack to run slowly on local
Option 3: @EveOnline #Bioinformatics

— Sam Nicholls (@samstudio8) December 5, 2016

• Analysis

Rather unexpectedly, initial results raised more questions than answers. This was pretty unwelcome news following the faff involved in just generating and testing the many metahaplomes. Once Gretel‘s recoveries were finished (the smoothest part of the operation, which was a surprise in itself, given the presence of Sun Grid Engine), another disgusting munging script of my own doing spat out the convoluted plot below:

The figure is a matrix of boxplots where:

Horizontal facets are the number of taxa in the tree (i.e. haplotypes)
Vertical facets are per-haplotype, per-base mutation rates (i.e. the probability that any genomic position on any of the taxa may be mutated from the common origin sequence)
X-axis of each boxplot represents each haplotype in the metahaplome, labelled A – O
Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for a given haplotype A – O, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set of 150bp at 7x per-haplotype coverage)

We could make a few wild speculations, but no concrete conclusions:

At low diversity, it may be impossible to recover haplotypes, especially for metahaplomes containing fewer haplotypes
Increasing diversity appears to create more variance in accuracy, but mean accuracy increases slightly in datasets with 3-5 haplotypes, but falls under 10+
Increasing the number of haplotypes in the metahaplome appears to increase recovery accuracy
In general, whilst there is variation, recovery rates across haplotypes is fairly clustered
It is possible to achieve 100% accuracy for some haplotypes under high diversity, and few true haplotypes

The data is not substantial on the surface. But, if anything, I had seemed to refute my own pre-print. Counter-intuitively, we now seem to have shown that the problem is easier in the presence of more haplotypes, and more variation. I was particularly disappointed with the ~80% accuracy rates on mid-level diversity on just 3 haplotypes. Overall, comparing the recovery accuracy to that of my less realistic Triviomes, appeared worse.

This made me sad, but mostly cross.

Today has been a very bad day pic.twitter.com/3o23HKBAl9

— Sam Nicholls (@samstudio8) November 29, 2016

The beginning of the end of my sanity

I despaired at the apparent loss of accuracy. Where had my over 90% recoveries gone? I could feel my PhD pouring away through my fingers like sand. What changed here? Indeed, I had altered the way I generated reads since the pre-print, was it the new read shredder? Or are we just less good at recovering from more realistic metahaplomes? With the astute assumption that everything I am working on equating to garbage, I decided to miserably withdraw from my PhD for a few days to play Eve Online…

YES HELLO I AM AN @EVEONLINE WING COMMANDER NOW pic.twitter.com/3nujrinfxu

— Sam Nicholls (@samstudio8) December 7, 2016

I enjoyed my experiences of space. I began to wonder whether I should quit my PhD and become an astronaut, shortly before my multi-million ISK ship was obliterated by pirates. I lamented my inability to enjoy games that lack copious micromanagement, before accepting that I am destined to be grumpy in all universes and that perhaps for now I should be grumpy in the one where I have a PhD to write.

In retrospect, I figure that perhaps the results in my pre-print and the ones in our new megaboxplot were not in disagreement, but rather incomparable in the first place. Whilst an inconclusive conclusion on that front would not answer any of the other questions introduced by the boxplots, it would at least make me a bit feel better.

Scattering recovery rates by variant count

So I constructed a scatter plot to show the relationship between the number of called variants (i.e. SNPs), and best Gretel recovery rate for each haplotype of all of the tested metahaplomes (dots coloured by coverage level below), against the overall best average recovery rates from my pre-print (large black dots).

Immediately, it is obvious that we are discussing a difference in magnitude when it comes to numbers of called variants, particularly when base mutation rates are high. But if we are still looking for excuses, we can consider the additional caveats:

Read coverage from the paper is 3-5x per haplotype, whereas our new data set uses a fixed coverage of 7x
The number of variants on the original data sets (black dots) are guaranteed, and bounded, by their length (250bp max)
Haplotypes from the paper were generated randomly, with equal probabilities for nucleotide selection. We can consider this as a 3 in 4 chance of disagreeing with the pseudo-reference: a 0.75 base mutation rate). The most equivalent subset of our new data consists of metahaplomes with a base mutation rate of “just” 0.25.

Perhaps the most pertinent point here is the last. Without an insane 0.75 mutation rate dataset, it really is quite sketchy to debate how recovery rates of these two data sets should be compared. This said, from the graph we can see:

Those 90+% average recoveries I’m missing so badly belong to a very small subset of the original data, with very few SNPs (10-25)
There are still recovery rates stretching toward 100%, particularly for the 3 haplotype data set, but for base mutation of 2.5% and above
Actually, recovery rates are not so sad overall, considering the significant number of SNPs, particularly for the 5 and 10 haplotype metahaplomes

Recoveries are high for unrealistic variation

Given that a variation rate of 0.75 is incomparable, what’s a sensible amount of variation to concern ourselves with anyway? I ran the numbers on my DHFR and AIMP1 data sets; dividing the number of called variants on my contigs by their total length. Naively distributing the number of SNPs across each haplotype evenly, I found the magic number representing per-haplotype, per-base variation to be around 1.5% (0.015). Of course, that isn’t exactly a vigorous analysis, but perhaps points us in the right direction, if not the correct order of magnitude.

So the jig is up? We report high recovery rates for unnecessarily high variation rates (>2.5%), but our current data sets don’t seem to support the idea that Gretel needs to be capable of recovering from metahaplomes demonstrating that much variation. This is bad news, as conversely, both our megaboxplot and scatter plot show that for rates of 0.5%, Gretel recoveries were not possible in either of the 3 or 5 taxa metahaplomes. Additionally at a level of 1% (0.01), recovery success was mixed in our 3 taxa datasets. Even at the magic 1.5%, for both the 3 and 5 taxa, average recoveries sit uninterestingly between 75% and 87.5%.

Engineering, report pic.twitter.com/qiIjiF04gW

— Swear Trek (@swear_trek) December 15, 2016

Confounding variables are the true source of misery

Even with the feeling that my PhD is going through rapid unplanned disassembly with me still inside of it, I cannot shake off the curious result that increasing the number of taxa in the tree appears to improve recovery accuracy. Each faceted column of the megaboxplot shares elements of the same tree. That is, the 3 taxa 0.1 (or 1%) diversity rate tree, is a subtree of the 15 taxa 0.1 diversity tree. The haplotypes A, B and C, are shared. Yet why does the only reliable way to improve results among those haplotypes seem to be the addition of more haplotypes? In fact, why are the recovery rates of all the 10+ metahaplomes so good, even under per-base variation of half a percent?

We’ve found the trap door, and it is confounding.

Look again at the pretty scatter plot. Notice how the number of called variants increases as we increase the number of haplotypes, for the same level of variation. Notice that it is also possible to actually recover the same A, B, and C haplotype from 3-taxa trees, at low diversity when there are 10 or 15 taxa present.

Recall that each branch of our tree is weighted by the same diversity rate. Thus, when aligned to a pseudo-reference, synthetic reads generated from metahaplomes with more original haplotypes have a much higher per-position probability for containing at least one disagreeing nucleotide in a pileup. i.e. The number of variants is a function of the number of original haplotypes, not just their diversity.

The confounding factor is the influence of Gretel‘s lookback parameter: L. We automatically set the order of the Markov chain used to determine the next nucleotide variant given the last L selected variants, to be equal to the average number of SNPs spanned by all valid reads that populated the Hansel structure. A higher number of called variants in a dataset offers not only more pairwise evidence for Hansel and Gretel to consider (as there are more pairs of SNPs), but also a higher order Markov chain (as there are more pairs of SNPs, on the same read). Thus, with more SNPs, the hypothesis is Gretel has at her disposal, sequences of length L that are not only longer, but more unique to the haplotype that must be recovered.

It seems my counter-intuitive result of more variants and more haplotypes making the problem easier, has the potential to be true.

This theory explains the converse problem of being unable to recover any haplotypes from 3 and 5-taxa trees at low diversity. There simply aren’t enough variants to inform Gretel. After all, at a rate of 0.5%, one would expect a mere 5 variants per 1000bp. Our scatter plot shows for our 3000bp pseudo-reference, at the 0.5% level we observe fewer than 50 SNPs total, across the haplotypes of our 3-taxa tree. Our 150bp reads are not long enough to span the gaps between variants, and Gretel cannot make decisions on how to cross these gaps.

This doesn’t necessarily mean everything is not terrible, but it certainly means the megaboxplot is not only an awful way to demonstrate our results, but probably a poorly designed experiment too. We currently confound the average number of SNPs on reads by observing just the number of haplotypes, and their diversity. To add insult to statistical injury, we then plot them in facets that imply they can be fairly compared. Yet increasing the number of haplotypes, increases the number of variants, which increases the density of SNPs on reads, and improves Gretel‘s performance: we cannot compare the 3-taxa and 15-taxa trees of the same diversity in this way as the 15-taxa tree has an unfair advantage.

I debated with my resident PhD tree pervert about this. In particular, I suggested that perhaps the diversity could be equally split between the branches, such that synthetic read sets from a 3-taxa tree and 15-taxa tree should expect to have the same number of called variants, even if the individual haplotypes themselves have a different level of variation between the trees. Chris argued that whilst that would fix the problem and make the trees more comparable, but going against the grain of simple biological explanations would reintroduce the boilerplate explanation bloat to the paper that we were trying to avoid in the first place.

Around this time I decided to say fuck everything, gave up and wrote a shell for a little while.

Deconfounding the megabox

So where are we now? Firstly, I agreed with Chris. I think splitting the diversity between haplotypes, whilst yielding datasets that might be more readily comparable, will just make for more difficult explanations in our paper. But fundamentally, I don’t think these comparisons actually help us to tell the story of Hansel and Gretel. I thought afterwards, and there are other nasty, unobserved variables in our megaboxplot experiment that directly affect the density of variants on reads, namely: read length and read coverage. We had fixed these to 150bp and 7x coverage for the purpose of our analysis, which felt like a dirty trick.

At this point, bioinformatics was starting to feel like a grand conspiracy, and I was in on it. Would it even be possible to fairly test and describe how our algorithm works through the noise of all of these confounding factors?

Me explaining bioinformatics pic.twitter.com/9skNMssNie

— Sam Nicholls (@samstudio8) November 29, 2016

I envisaged the most honest method to describe the efficacy of my approach, as a sort of lookup table. I want our prospective users to be able to determine what sort of haplotype recovery rates might be possible from their metagenome, given a few known attributes, such as read length and coverage, at their region of interest. I also feel obligated to show under what circumstances Gretel performs less well, and offer reasoning for why. But ultimately, I want readers to know that this stuff is really fucking hard.

Introducing the all new low-fat less-garbage megaboxplot

Here is where I am right now. I took this lookup idea, and ran a new experiment consisting of some 1500 sets of reads, and runs of Gretel, and threw the results together to make this:

Horizontal facets represent synthetic read length
Vertical facets are (again) per-haplotype, per-base mutation rates, this time expressed as a percentage (so a rate of 0.01, is now 1%)
Colour coded X-axis of each boxplot depicts the average per-haplotype read coverage
Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for all of the five haplotypes, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set)

I feel this graph is much more tangible to users and readers. I feel much more comfortable expressing our recovery rates in this format, and I hope eventually our reviewers and real users will agree. Immediately we can see this figure reinforces some expectations, primarily increasing the read length and/or coverage, has a large improvement on Gretel‘s performance. Increasing read length also lowers the requirements on coverage for accuracy.

This seems like a reasonable proof of concept, so what’s next?

Generate a significant amount more input data, preferably in a way that doesn’t make me feel ill or depressed
Battle with the cluster to execute more experiments
Generate many more pretty graphs

I’d like to run this test for metahaplomes with a different number of taxa, just to satisfy my curiosity. I also want to investigate the 1 – 2% diversity region in a more fine grain fashion. Particularly important will be to repeat the experiments with multiple metahaplomes for each read length, coverage and sequence diversity parameter triplet, to randomise away the influence of the tree itself. I’m confident this is the reason for inconsistencies in the latest plot, such as the 1.5% diversity tree with 100bp reads yielding no results (likely due to this particular tree generating haplotypes such that piled up reads contain a pair of variants more than 100bp apart).

Conclusion

Generate more fucking metahaplomes
Get this fucking paper out

tl;dr

I don’t want to be doing this PhD thing in a year’s time
I’ve finally started looking again at our glorious rejected pre-print
The Trivial haplomes tanked, they were too hard to explain to reviewers and actually don’t provide that much context on Gretel anyway
New tree-based datasets have superseded the triviomes²
Phylogenetics maybe isn’t so bad (but I’m still not sure)
Once again, the cluster and parallelism in general has proven to be the bane of my fucking life
It can be quite difficult to present results in a sensible and meaningful fashion
There are so many confounding factors in analysis and I feel obligated to control for them all because it feels like bad science otherwise
I’m fucking losing it lately
Playing spaceships in space is great but don’t expect to not be blown out of fucking orbit just because you are trying to have a nice time
I really love ggplot2, even if the rest of R is garbage
I’ve been testing Gretel at “silly” levels of variation thinking that this gives proof that we are good at really hard problems, but actually more variation seems to make the problem of recovery easier
1.5% per-haplotype per-base mutation seems to be my current magic number (n=2, because fuck you)
I wrote a shell because keeping track of all of this has been an unmitigated clusterfuck
I now have some plots that make me feel less like I want to jump off something tall
I only seem to enjoy video games that have plenty of micromanagement that stress me out more than my PhD
I think Bioinformatics PhD Simulator 2018 would make a great game
Unrealistic testing cannot give realistic answers
My supervisor, Chris is a massive dendrophile³
HR bullshit makes a grumpy PhD student much more grumpy
This stuff, is really fucking hard

supplant HAH GET IT ↩
superseeded HAHAH I AM ON FIRE ↩
phylogenphile? ↩

Status Report: May 2016 (Metahaplomes: The graph that isn’t)

Sam — Sun, 12 Jun 2016 20:03:24 +0000

It would seem that a sufficient amount of time has passed since my previous report to discuss how everything has broken in the meantime. You would have left off with a version of me who had not long solidified the concept of the metahaplome: a graph-inspired representation of the variation observed across aligned reads from a sequenced metagenome. Where am I now?

Metahaplomes

The graph that isn’t

At the end of my first year, I returned from my reconnaissance mission to extract data mining knowledge from a leading Belgian university with a prototype for a data structure that was fit to house a metahaplome; a probabilistically weighted graph that can be traversed to extract likely sequences of variants on some gene of interest. I say graph, because the structure and API does not look or particularly act like a traditional graph at all. Indeed, the current representation is a four dimensional matrix that stores the number of observations of a symbol (SNP) A at position i, co-occurring with a symbol B at position j.

This has proved problematic as I’ve had difficulty in explaining the significance of this to people who dare to ask what my project is about. “What do you mean it’s not a graph? There’s a picture of a graph on your poster right there!?”. Yes, the matrix can be exploited to build a simple graph representation, but not without some information loss. As a valid gene must select a variant at each site, one cannot draw a graph that contains edges from sites of polymorphisms that are not adjacent (as a path that traverses such an edge would skip a variant site¹). We therefore lose the ability to encode any information regarding co-occurrence of non-adjacent variants (abs(i - j) != 1) if we depict the problem with a simple graph alone.

To circumvent this, edges are not weighted upfront. Instead, to take advantage of the evidence available, the graph is dynamically weighted during traversal (the movement to the next node is variable, and depends on the nodes that have been visited already) using the elements stored in the matrix.

Thus we have a data structure capable of being utilised like a graph, with some caveats: it is not possible to enumerate all possibilities or assign weights to all edges upfront before traversal (or for that matter, a random edge), and a fog of war exists during any traversal (i.e. it is not possible to predict where a path may end without exploring). Essentially we have no idea what the graph looks like, until we explore it. Despite this, my solution fuses the advantage of a graph’s simple representation, with the advantage of an adjacency matrix that permits storage of all pertinent information. Finally, I’ve been able to describe the structure and algorithm verbally and mathematically.

Reweighting

Of course, having this traversable structure that takes all the evidence seen across the reads into account is great, but we need a reliable method for rescuing more than just one possible gene variant from the target metahaplome. My initial attempts at this involved invoking stochastic jitter during traversal to quite poor effect. It was not until some time after I’d got back from putting mayonnaise on everything that I considered altering the observation matrix that backs the graph itself to achieve this.

My previous report described the current methodology: given a complete path, check the marginal probability for each variant at each position of the path (i.e. the probability one would select the same nucleotide if you were to look at variant site in isolation) and determine the smallest marginal. Then iterate over the path, down-weighting the element of the observation matrix that stores the number of occurrences of the i‘th nucleotide and the i+1‘th selected nucleotide, by multiplying the existing value by the lowest marginal (which will be greater than 0, but smaller than 1) and subtracting that value from the current count.

Initial testing yielded more accurate results with this method than anything I had tried previously, where accuracy is quantified by this not happening:

oh no dont tell my supervisors pic.twitter.com/7j6C2oS8BD

— Sam Nicholls (@samstudio8) December 26, 2015

The algorithm is evaluated with a data set of several known genes from which a metagenome is simulated. The coloured lines on the chart above refer to each known input gene. The y axis represents the percentage of variants that are “recovered” from the metagenome, the x axis is the iteration (or path) number. In this example, a questionable strategy caused poor performance (other than the 100% recovery of the blue gene), and a bug in handling elements that are reweighted below 1 allowed the algorithm to enter a periodic state.

After implementing the latest strategy, performance compared to the above increased significantly (at least on the limited data sets I have spent the time curating), but I was still not entirely satisfied. Recognising this was going to take much more time and thought, I procrastinated by writing up the technical aspects of my work in excruciating mathematical detail in preparation for my next paper. To wrap my head around my own equations, I commandeered the large whiteboards in the undergraduate computing room and primed myself with coffee and Galantis. Unfortunately, after a hour or two of excited scribbling, this happened:

my day pic.twitter.com/4Zhr5AHGY5

— Sam Nicholls (@samstudio8) May 3, 2016

I encountered an oversight. Bluntly:

.@jhrtn it's a crap owl who is pissed off that the matrix it is guarding has been incorrectly re-weighted by a half baked algorithm i wrote

— Sam Nicholls (@samstudio8) May 3, 2016

Despite waxing lyrical about the importance of evidence arising from non-adjacent variant sites, I’d overlooked them in the reweighting process. Although frustrated with my own incompetence, this issue was uncovered at a somewhat opportune time as I was looking for a likely explanation for what felt like an upper bound on the performance of the algorithm. As evidence (observation counts) for adjacent pairwise variants was decreased through reweighting, non-adjacent evidence was becoming an increasingly important factor in the decision making process for path traversal, simply by virtue of the counts being larger (as they were left untouched). Thus paths were still being heavily coerced along particular routes and were not afforded the opportunity to explore more of the graph, yielding less accurate results (less recovered variants) for more divergent input genes.

As usual in these critical oversights, the fix was trivial (just ensure to apply the same rules for reweighting the adjacent pairs of variants to the non-adjacent ones too), and indeed, performance was bumped by around 5%p. Hooray.

Evaluation

Generating test data (is still a pain in the arse)

So here we are, I’m still somewhat stuck in a data rut. Generating data sets (that can be verified) is a somewhat convoluted procedure. Whilst, to run the algorithm all one needs is a BAM of aligned reads, and an associated VCF of called SNP sites; to empirically test output, we also need to know what the output genes should look like. Currently this requires a “master” FASTA (the origin gene), a FASTA of similar genes (the ones we actually want to recover) and a blast hit table that documents how those similar genes align to the master. The workflow for generating and testing a data set looks like this:

Select an interesting, arbitrary master gene from a database (master.fa)
blast for similar genes and select several hits with decreasing identity
Download FASTA (genes.fa) and associated blast hit table (hits.txt) for selected genes
Simulate reads by shredding genes.fa (reads.fq)
Align reads (reads.fq) with bowtie to pseudo-reference (master.fa) to create (hoot.bam)
Call for SNPs on (hoot.bam) to create a VCF (hoot.vcf)
Construct metahaplome and traverse paths with reads (hoot.bam) and SNPs (hoot.vcf)
Output potential genes (out.fa)
Evaluate each result in out.fa against each hit in hits.txt
- Extract DNA between subject start and end for record from genes.fa
- Determine segment of output (from out.fa) overlapping current hit (from genes.fa)
- Convert co-ordinates of SNP to current hit (gene.fa)
- Confirm consistency between SNP on output, to
- Return matrix of consistency for each output gene, to each input gene

Discordant alignments between discordant aligners

The issue at hand primarily arises from discordant alignment decisions between the two alignment processes that make up components of the pipeline; blast and bowtie. Although blast is used to select the initial genes (given some “master”), and its resulting hit table is also used to evaluate the approach at the end of the algorithm, bowtie is used to align the reads (reads.fq) to that same master. Occasional disagreements between both algorithms are inevitable on real data, but I assumed that given the simplicity of the data (simulated reads of uniform length, no errors, reasonable identity) that they would behave the same. It may sound like an obvious problem source, but when several genes are reported as correctly extracted with high accuracy (identity) and one or two are not, you might forgive me for thinking that the algorithm just needed tweaking rather than an underlying problem stemming from the alignment steps! This led to much more tail chasing than I would care to admit to.

For one example: I investigated a poorly reconstructed gene by visualising the input BAM produced by bowtie with Tablet (a rather nifty BAM+VCF+FA viewer). It turned out that for input reads belonging to one of the input genes, bowtie had called an indel¹, causing a disagreement as to what the empirical base at each SNP following that indel should have been. That is, although all of the reads from that particular input gene were aligned by bowtie as having an indel (and thus shifting bases in those reads), and processed by my algorithm with that indel taken into account, at the point of evaluation, the blast hit table is the gold standard; what may have been the correct variant (indel withstanding) would be determined as incorrect by the alignment of the hit table.

I suppose the solution might be to switch to one aligner, but I’m aware that even the same aligner can make different decisions under differing conditions (read length).
It’s important to note that currently the hit table is also used to define where to begin sharding reads for the simulated metagenome, which in turn causes trouble if bowtie disagrees with where an alignment begins and ends. I’ve had cases where blast aligns the first base of the subject (input gene) to the first base of the query (master) but on inspection with Tablet, it becomes evident that bowtie clips the first few bases when aligning opening reads to the master. This problem is a little more subtle, and in current practice causes little trouble. Although the effect would be a reduction in observed evidence for variants at SNPs that happen to occur within the first few bases of the gene, my test sets so far do not have a SNP site so close to the start of the master. This is obviously something to watch out for, though.

At this time I’ve just been manually altering the hit table to reconcile differences between the two aligners, which is gross.

Bugs in the machine

Of course, a status report from me is not complete without some paragraph where I hold my hands up in the air and say everything was broken due to my own incompetence. Indeed, a pair of off-by-one bugs in my evaluation algorithm also warped reported results. The first, a regression introduced after altering the parameters under which the evaluation function determines an overlap between the current output gene and current hit, led to a miscalculation when translating the base position on the master to the base position on the expected input under infrequent circumstances, causing the incorrect base to be compared to the expected output. This was also accidentally fixed when I refactored the code and saw a very small increase in performance.

The second, an off-by-one error in the reporting of a new metric: “deviations from reference”, caused the results to suddenly appear rather unimpressive. The metric measures the number of bases that are different from the pseudo-reference (master.fa) that were correctly recovered by my algorithm to match an original gene from gene.fa. Running my algorithm now yielded a results table describing impressive gene recovery scores (>89%) but those genes only appeared to differ from the reference by merely a few SNPs (<10). How could we suck at recovering sequences that barely deviate from the master? Why does it take so many iterations? After getting off the floor and picking up the shattered pieces of my ego and PhD, I checked the VCF and confirmed there were over a hundred SNPs across all the genes. Curious, I inspected the genes manually with Tablet to see how they compared to the reference. Indeed, there were definitely more than the four reported for one particular case, so what was going on?

To finish quickly; path iteration numbers start from 0, but are reported to the user as iter + 1, because the 0’th iteration is not catchy. My mistake was using the iter + 1 to also access the number of deviations from the reference detected in the current iteration – in a zero indexed structure. I was fetching the number of deviations successfully extracted by the path after the this one, which we would expect to be poor, as the structure would have been reweighted to prevent that path from appearing again. Nice work, me. This fix made things a little more interesting:

ONE HUNDRED PERCENT MATCH TO SECRET INPUT EXTRACTED, 118 SNPS, 84 DEVIATIONS FROM THE REFERENCE. ITSSS HAPPENINNGGG! pic.twitter.com/KRdVaMT92U

— Sam Nicholls (@samstudio8) May 29, 2016

More testing of the testing, is evidently necessary.

Conclusion

So where does that leave us? Performance is up, primarily because the code that I wrote to evaluate performance (and reweight) is now less broken. Generating data sets is still a pain in the arse, but I have got the hang of the manual process involved so I can at least stop hiding from the work to be done. It might be worth investigating consolidating all of my alignment activities into one aligner to improve my credit score. Results are looking promising, this algorithm is now capable of extracting genes (almost or entirely whole) from simulated (albeit quite simple) metagenomes.

Next steps will be more testing, writing the method itself as a paper, and getting some proper biological evidence from the lab that this work can do what I tell people it can do.

In other news

I’m actually a published scientist
We won some research funding to put me in a lab coat over summer and prove this works
Metahaplome poster won an award and represented the university at “Excellence with Impact”
I’ve managed to enjoy the sea
I’ve decided to be a beekeeper
I went to the Gregynog stats conference to enjoy an old house and talk about my work
I’ve done some lab work
I was invited to review my first paper (someone trusts what I say?)
I continue to succeed at pretending to be a biologist

i like to pretend pic.twitter.com/02WrtLf1fo

— Sam Nicholls (@samstudio8) May 17, 2016

tl;dr

I continue to be alive and bad at both blog and implementing experimental data structures
I fixed my program not working by fixing the thing that told me it wasn’t working
If your evaluator has holes in, you’ll spend weeks chasing problems that don’t exist
Never assume how someone else’s software will work, especially if you are assuming it will work like a different piece of software that you are already making assumptions about
Always be testing (especially testing the testing)
This thing actually fucking works
An unpleasant side effect of doing a PhD is the rate of observed existential crises increases
Life continues to be a series of off-by-one errors, punctuated with occasional trips to the seaside

Let’s not even talk about indels for now. ↩ ↩

Status Report: February 2016

Sam — Tue, 01 Mar 2016 23:41:33 +0000

I have a meeting with Amanda tomorrow morning about my Next Paper, so thought it might be apt to gather some thoughts and report on the various states of disarray that the different facets of my PhD are currently in. Although I’ve briefly outlined the goal of my PhD in a previous status report, as of yet I’ve avoided exploring much of the detail here; partly as the work is unpublished, but primarily due to my laze when it comes to blogging¹.

The Metahaplome

At the end of January I gave a talk at the Aberystwyth Bioinformatics Workshop². The talk briskly sums up the work done so far over the first year-and-a-bit of my PhD and introduces the metahaplome: our very own new -ome, a graph-inspired representation of the variation in single nucleotide polymorphisms observed across aligned reads from a sequenced metagenome. The idea is to isolate and store information only the genomic positions that actually vary across sequenced reads and more importantly, keep track of the observed evidence for these variations to co-occur together. This evidence can be exploited to reconstruct sequences of variants that are likely to actually exist in nature, as opposed to the crude approximations provided by assembly-algorithm-de-jour.

I spent the summer of last year basking in the expertise of the data mining group at KU Leuven; learning to drive on the wrong side of the road, enjoying freshly produced breads and chocolate, incorrectly arguing that ketchup should be applied to fries instead of mayonnaise and otherwise pretending to be Belgian. I took with me two different potential representations for the metahaplome and hoped to come back with an efficient Dutch solution that would solve the problem quickly and accurately. Instead, amongst the several kilograms of chocolate, I returned with the crushing realisation that the problem I was dealing with was certainly NP-hard (i.e. very hard) and that my suggestion of abusing probability was likely the best candidate for generating solutions.

The trip wasn’t a loss however: my best friend and I explored some Belgian cities, the coast of the Nederlands and accidentally crossed the invisible Belgian border into the French-speaking Walloon, much to our confusion. I discovered mayonnaise wasn’t all that bad, attended a public thesis defence and had the honour of an invite to celebrate the award of a PhD by getting drunk in a castle. I discarded several implementations of the data structures used to house the metahaplome and began work on a program that could parse sequenced reads into the latest structure. I came up with a method for calculating weights of edges in the graph, and another method for approximating those calculations after they also proved as unwieldy as the graph itself.

The metahaplome is approximations all the way down.

But the important question, ~~will this get me a PhD~~ does it work? Can my implementation be fed some sequenced reads that (probably) span a gene that is shared but variable between some number of different species, sampled together in a metagenome? The short answer is yes and no. The long answer is I’m not entirely sure yet.

Trouble In Silico

I’m at an empirical impasse, the algorithm performs very well or very poorly and occasionally traps itself in a bizarre periodic state, depending on the nature of the input data. Currently the as-of-yet unnamed³ metahaplome algorithm is being evaluated against several data sets which can be binned in one of three categories:

Triviomes: Simulated-Simulated Data
Generates a short gene with N single nucleotide polymorphisms. The gene has M different known variants (sets of N SNPs) with each m_i expressed in a simulated sample with some proportion. A script generates a SAM and VCF for the reads and SNP positions respectively. The metahaplome is constructed and traversed and the algorithm is evaluated by its ability to recover the M known variants.
Simulated-Real Data
A gene is pulled from a database and submitted to BLAST. Sequences of similar but not exact identity are identified and aligned to the original gene. The extracted hits are aligned to the original gene and variants are called loosely with samtools. Each gene is then fragmented into k-mers that act as artificial reads for the construction of the metahaplome. In a similar fashion to before, the metahaplome is traversed and the algorithm is evaluated by its ability to recover the genes extracted from the BLAST search. Although using real data, this method is still rather naive in itself and further analysis would be needed to evaluate the algorithm’s stability when encountering:
- Indels
- Noise and error
- Poor coverage
- Very skewed proportions of m_i
Real Data for Real
Variant calling is completed on real reads that align to a region on a metagenomic assembly that looks “interesting”. A metahaplome is constructed and traversed. The resulting paths typically match hypothetical or uncharacterised proteins with some identity. This is exciting and impossible to evaluate empirically which is nice because nobody can prove how the results are incorrect yet.

In general the algorithm performs well on triviomes, which is good news considering their simplicity. However, mixed results are gained from simulated-real data, but I don’t have enough evidence as to why this is the case. The real issue here stems from the difficulty in generating test data in an acceptable form for my own software. Reads must be aligned and SNPs called beforehand, but the software for assembly, variant calling and short read alignment are external to my own work and can produce results that I might not consider optimal for evaluation. In particular, when generating triviomes, I had difficulties with getting a short read aligner to make read alignments that I would expect to see — for this reason, at this time the triviome script generates its own SAM.

Problems at both ends

My trouble isn’t limited to the construction of the metahaplome either. Whilst the majority of initial paths recovered by my algorithm are on target to a gene that we know exists in the sample, we want to go on to recover the second, third, …, i‘th best paths from the graph. To do this, the edges in the graph must be re-weighted. My preliminary work shows there is quite an optimal knife-edge here: aggressive re-weighting causes the algorithm to fail to return similar paths (even if they do really exist in the evidence), but modest re-weighting causes the algorithm to converge on new paths slowly (or not at all).

The situation is further complicated by coverage. An “important” edge in the graph (i.e. is expected to be included in many of the actual genes) may have very little evidence, and aggressive re-weighting doesn’t afford the algorithm the opportunity to explore such branches before they are effectively pruned away. Any form of re-weighting must consider that some edges are covered more than others, but it is unknown to us whether that is due to over-representation in the sample or whether that edge really should appear as part of many paths.

My current strategy is triggered when a path has been recovered. For each edge in the newly extracted path (where an edge represents one SNP followed by another), the marginal distributions of the selected transition is inspected. Every selected edge is then reduced in proportion to the value of the lowest marginal: i.e. the least likely transition observed on the new path. Thus far this seems to strike a nice balance but testing has been rather limited.

What now?

Simplify generation of evaluation data sets, currently this bottleneck is a pain in the ass and holding up progress.
Standardise testing and keep track of results as part of a test suite instead of ad-hoc tweak-and-test.
Use multiple new curated simulated-real data sets to explore and optimise the algorithm’s behaviour.
Jury still out on edge re-weighting methodology.

In Other News

Publishing of first paper imminent!
Co-authored a research grant to acquire funding to test the results of the metahaplome recovery algorithm in a lab.
My PR to deprecate legacy samtools sort syntax was accepted for the 1.3 release and I got thanked on the twitters :’)
A couple of samtools odd-jobs, including a port of bamcheckR to samtools stats in the works…
sunblock still saving me hours of head-banging-on-desk time but not tidy enough to tell you about yet…
I’ll be attending the Microbiology Society Annual Conference in March. Say hello!

tl;dr

I’m still alive.
This stuff is quite hard which probably means it will be worth a PhD in the long run.
I am still bad at blog.

Sorry not sorry. ↩
The video is just short of twelve minutes, but YouTube’s analytics tell me the average viewer gives up after 5 minutes 56 seconds. Which is less than ten seconds after I mention the next segment of the talk will contain statistics. Boo. ↩
That is, I haven’t come up with a catchy, concise and witty acronym for it yet. ↩

Meet the Metahaplome

Sam — Thu, 21 Jan 2016 20:59:48 +0000

Yesterday, I gave a talk at the Aberystwyth Bioinformatics Workshop on the metahaplome: a graph inspired structure for encoding the variation of single nucleotide polymorphisms (SNPs) observed across aligned sequenced reads. The talk is unintentionally lightning after I realised I had more slides than time and that I was all that stood between delegates and the pub, but it seemed to provide a good introduction to some of my work so far:

As a semi-interesting aside, I activated the workout mode on my Fitbit shortly before heading up to the podium to deliver my talk. My heart rate reached a peak of 162bpm and maintained an average of 126bpm. I was called to the stage ~5minutes into the “workout” where one can observe a rise and peak in heart rate before a slow and gentle decrease in heart rate as I become more comfortable during the talk and questions:

Status Report: October 2015

Sam — Sun, 01 Nov 2015 19:30:25 +0000

As is customary with any blog that I attempt to keep, I’ve somewhat fallen behind in providing timely updates and am instead hoarding drafts in various states of readiness. This was unhelped by my arguably ill thought out move to install WordPress and the rather painful migration that followed as a result. Now that the dust has mostly settled, I figured it might be nice to outline what I am actually working on before inevitably publishing a new epic tale of computational disaster.

The bulk of my work falls under two main projects that should hopefully sound familiar to those who follow the blog:

Metagenomes

I’ve now entered the second year of my PhD at Aberystwyth University, following my recent fries-and-waffle-fueled research adventure in Belgium. As a brief introduction to the uninitiated, I work in metagenomics: the study of all genetic sequences found in an environment. In particular, I’m interested in the metagenomes of microbial populations that have adapted to produce “interesting” enzymes (catalysts for chemical reactions). A few weeks ago, I presented a poster on the “metahaplome“¹ which is the culmination of my first year of work, to define and formalize how variation in sequences that produce these enzymes can be collected and organized.

DNA Quality Control

Over the summer, I returned to the Wellcome Trust Sanger Institute to continue some work I started as part of my undergraduate thesis. I’ve introduced the task previously and so will spare you the long winded description, but the project initially stalled due to the significant time and effort required to prepare part of the data set. During my brief re-visit, I picked up where I left off with the aim to complete the data set. You may have read that I encountered several problems along the way, and even when this mammoth task finally appeared complete, it was not. Shortly after arriving in Leuven, the final execution of the sample improvement pipeline was done. We’re ready to move forward with the analysis.

Side Projects

As is inevitable when you give a PhD to somebody with a short attention span, I have begun to accumulate some side projects:

SAMTools

The Sequence Alignment and Mapping Tools² suite is a hugely popular open source bioinformatics tookit for interacting with sequencing data. During my undergraduate thesis I contributed a naive header parser to a project fork, that improved the speed of merges of large numbers of sequence files by several orders of magnitude. Recently, amongst a few small fixes here and there, I’ve added functionality to produce samtools stats output split by tags (such as @RG lines) and submitted a proposal to deprecate legacy samtools sort usage. With some time over upcoming holidays, I hope to finally contribute a proper header parser in time for samtools 1.4.

goldilocks

You may remember that I’d authored a Python package called goldilocks (YouTube: Goldilocks: Locating genomic regions that are “just right”, 1st RSG UK Symposium, Oct 2014) as part of my undergraduate work, to find a “just right” 1Mbp region of the human genome that was “representative” in terms of variation expressed. Following some tidying and much optimisation, it’s now a proper package, documented, and I’m now waiting to hear feedback on the submission of my first paper.

sunblock

You may have noticed my opinion on Sun Grid Engine, and the trouble I have had in using it at scale. To combat this, I’ve been working on a small side project called sunblock: a Python command line tool that encapsulates the submission and management of cluster jobs via a more user-friendly interface. The idea is to save anybody else from ever having to use Sun Grid Engine ever again. Thanks to a night in Belgium where it was far too warm to sleep, and a little Django magic, sunblock acquired a super-user-friendly interface and database backend.

Blog

This pain in the arse blog.

tl;dr

I’m still alive
I’m still working
Blogs are hard work

Yes, sorry, it’s another -ome. I’m hoping it won’t find its way on to Jonathan Eisen’s list of #badomes. ↩
Not to be confused with a series of tools invented by me, sadly. ↩