I have a meeting with Amanda tomorrow morning about my Next Paper™, so thought it might be apt to gather some thoughts and report on the various states of disarray that the different facets of my PhD are currently in. Although I’ve briefly outlined the goal of my PhD in a previous status report, as of yet I’ve avoided exploring much of the detail here; partly as the work is unpublished, but primarily due to my laze when it comes to blogging¹.

The Metahaplome

At the end of January I gave a talk at the Aberystwyth Bioinformatics Workshop². The talk briskly sums up the work done so far over the first year-and-a-bit of my PhD and introduces the metahaplome: our very own new -ome, a graph-inspired representation of the variation in single nucleotide polymorphisms observed across aligned reads from a sequenced metagenome. The idea is to isolate and store information only the genomic positions that actually vary across sequenced reads and more importantly, keep track of the observed evidence for these variations to co-occur together. This evidence can be exploited to reconstruct sequences of variants that are likely to actually exist in nature, as opposed to the crude approximations provided by assembly-algorithm-de-jour.

I spent the summer of last year basking in the expertise of the data mining group at KU Leuven; learning to drive on the wrong side of the road, enjoying freshly produced breads and chocolate, incorrectly arguing that ketchup should be applied to fries instead of mayonnaise and otherwise pretending to be Belgian. I took with me two different potential representations for the metahaplome and hoped to come back with an efficient Dutch solution that would solve the problem quickly and accurately. Instead, amongst the several kilograms of chocolate, I returned with the crushing realisation that the problem I was dealing with was certainly NP-hard (i.e. very hard) and that my suggestion of abusing probability was likely the best candidate for generating solutions.

The trip wasn’t a loss however: my best friend and I explored some Belgian cities, the coast of the Nederlands and accidentally crossed the invisible Belgian border into the French-speaking Walloon, much to our confusion. I discovered mayonnaise wasn’t all that bad, attended a public thesis defence and had the honour of an invite to celebrate the award of a PhD by getting drunk in a castle. I discarded several implementations of the data structures used to house the metahaplome and began work on a program that could parse sequenced reads into the latest structure. I came up with a method for calculating weights of edges in the graph, and another method for approximating those calculations after they also proved as unwieldy as the graph itself.

The metahaplome is approximations all the way down.

But the important question, ~~will this get me a PhD~~ does it work? Can my implementation be fed some sequenced reads that (probably) span a gene that is shared but variable between some number of different species, sampled together in a metagenome? The short answer is yes and no. The long answer is I’m not entirely sure yet.

Trouble In Silico

I’m at an empirical impasse, the algorithm performs very well or very poorly and occasionally traps itself in a bizarre periodic state, depending on the nature of the input data. Currently the as-of-yet unnamed³ metahaplome algorithm is being evaluated against several data sets which can be binned in one of three categories:

Triviomes: Simulated-Simulated Data
Generates a short gene with N single nucleotide polymorphisms. The gene has M different known variants (sets of N SNPs) with each m_i expressed in a simulated sample with some proportion. A script generates a SAM and VCF for the reads and SNP positions respectively. The metahaplome is constructed and traversed and the algorithm is evaluated by its ability to recover the M known variants.
Simulated-Real Data
A gene is pulled from a database and submitted to BLAST. Sequences of similar but not exact identity are identified and aligned to the original gene. The extracted hits are aligned to the original gene and variants are called loosely with samtools. Each gene is then fragmented into k-mers that act as artificial reads for the construction of the metahaplome. In a similar fashion to before, the metahaplome is traversed and the algorithm is evaluated by its ability to recover the genes extracted from the BLAST search. Although using real data, this method is still rather naive in itself and further analysis would be needed to evaluate the algorithm’s stability when encountering:
- Indels
- Noise and error
- Poor coverage
- Very skewed proportions of m_i
Real Data for Real
Variant calling is completed on real reads that align to a region on a metagenomic assembly that looks “interesting”. A metahaplome is constructed and traversed. The resulting paths typically match hypothetical or uncharacterised proteins with some identity. This is exciting and impossible to evaluate empirically which is nice because nobody can prove how the results are incorrect yet.

In general the algorithm performs well on triviomes, which is good news considering their simplicity. However, mixed results are gained from simulated-real data, but I don’t have enough evidence as to why this is the case. The real issue here stems from the difficulty in generating test data in an acceptable form for my own software. Reads must be aligned and SNPs called beforehand, but the software for assembly, variant calling and short read alignment are external to my own work and can produce results that I might not consider optimal for evaluation. In particular, when generating triviomes, I had difficulties with getting a short read aligner to make read alignments that I would expect to see — for this reason, at this time the triviome script generates its own SAM.

Problems at both ends

My trouble isn’t limited to the construction of the metahaplome either. Whilst the majority of initial paths recovered by my algorithm are on target to a gene that we know exists in the sample, we want to go on to recover the second, third, …, i‘th best paths from the graph. To do this, the edges in the graph must be re-weighted. My preliminary work shows there is quite an optimal knife-edge here: aggressive re-weighting causes the algorithm to fail to return similar paths (even if they do really exist in the evidence), but modest re-weighting causes the algorithm to converge on new paths slowly (or not at all).

The situation is further complicated by coverage. An “important” edge in the graph (i.e. is expected to be included in many of the actual genes) may have very little evidence, and aggressive re-weighting doesn’t afford the algorithm the opportunity to explore such branches before they are effectively pruned away. Any form of re-weighting must consider that some edges are covered more than others, but it is unknown to us whether that is due to over-representation in the sample or whether that edge really should appear as part of many paths.

My current strategy is triggered when a path has been recovered. For each edge in the newly extracted path (where an edge represents one SNP followed by another), the marginal distributions of the selected transition is inspected. Every selected edge is then reduced in proportion to the value of the lowest marginal: i.e. the least likely transition observed on the new path. Thus far this seems to strike a nice balance but testing has been rather limited.

What now?

Simplify generation of evaluation data sets, currently this bottleneck is a pain in the ass and holding up progress.
Standardise testing and keep track of results as part of a test suite instead of ad-hoc tweak-and-test.
Use multiple new curated simulated-real data sets to explore and optimise the algorithm’s behaviour.
Jury still out on edge re-weighting methodology.

In Other News

Publishing of first paper imminent!
Co-authored a research grant to acquire funding to test the results of the metahaplome recovery algorithm in a lab.
My PR to deprecate legacy samtools sort syntax was accepted for the 1.3 release and I got thanked on the twitters :’)
A couple of samtools odd-jobs, including a port of bamcheckR to samtools stats in the works…
sunblock still saving me hours of head-banging-on-desk time but not tidy enough to tell you about yet…
I’ll be attending the Microbiology Society Annual Conference in March. Say hello!

tl;dr

I’m still alive.
This stuff is quite hard which probably means it will be worth a PhD in the long run.
I am still bad at blog.

Sorry not sorry. ↩
The video is just short of twelve minutes, but YouTube’s analytics tell me the average viewer gives up after 5 minutes 56 seconds. Which is less than ten seconds after I mention the next segment of the talk will contain statistics. Boo. ↩
That is, I haven’t come up with a catchy, concise and witty acronym for it yet. ↩