catch up – Samposium

Status Report: February 2016

Sam — Tue, 01 Mar 2016 23:41:33 +0000

I have a meeting with Amanda tomorrow morning about my Next Paper, so thought it might be apt to gather some thoughts and report on the various states of disarray that the different facets of my PhD are currently in. Although I’ve briefly outlined the goal of my PhD in a previous status report, as of yet I’ve avoided exploring much of the detail here; partly as the work is unpublished, but primarily due to my laze when it comes to blogging¹.

The Metahaplome

At the end of January I gave a talk at the Aberystwyth Bioinformatics Workshop². The talk briskly sums up the work done so far over the first year-and-a-bit of my PhD and introduces the metahaplome: our very own new -ome, a graph-inspired representation of the variation in single nucleotide polymorphisms observed across aligned reads from a sequenced metagenome. The idea is to isolate and store information only the genomic positions that actually vary across sequenced reads and more importantly, keep track of the observed evidence for these variations to co-occur together. This evidence can be exploited to reconstruct sequences of variants that are likely to actually exist in nature, as opposed to the crude approximations provided by assembly-algorithm-de-jour.

I spent the summer of last year basking in the expertise of the data mining group at KU Leuven; learning to drive on the wrong side of the road, enjoying freshly produced breads and chocolate, incorrectly arguing that ketchup should be applied to fries instead of mayonnaise and otherwise pretending to be Belgian. I took with me two different potential representations for the metahaplome and hoped to come back with an efficient Dutch solution that would solve the problem quickly and accurately. Instead, amongst the several kilograms of chocolate, I returned with the crushing realisation that the problem I was dealing with was certainly NP-hard (i.e. very hard) and that my suggestion of abusing probability was likely the best candidate for generating solutions.

The trip wasn’t a loss however: my best friend and I explored some Belgian cities, the coast of the Nederlands and accidentally crossed the invisible Belgian border into the French-speaking Walloon, much to our confusion. I discovered mayonnaise wasn’t all that bad, attended a public thesis defence and had the honour of an invite to celebrate the award of a PhD by getting drunk in a castle. I discarded several implementations of the data structures used to house the metahaplome and began work on a program that could parse sequenced reads into the latest structure. I came up with a method for calculating weights of edges in the graph, and another method for approximating those calculations after they also proved as unwieldy as the graph itself.

The metahaplome is approximations all the way down.

But the important question, ~~will this get me a PhD~~ does it work? Can my implementation be fed some sequenced reads that (probably) span a gene that is shared but variable between some number of different species, sampled together in a metagenome? The short answer is yes and no. The long answer is I’m not entirely sure yet.

Trouble In Silico

I’m at an empirical impasse, the algorithm performs very well or very poorly and occasionally traps itself in a bizarre periodic state, depending on the nature of the input data. Currently the as-of-yet unnamed³ metahaplome algorithm is being evaluated against several data sets which can be binned in one of three categories:

Triviomes: Simulated-Simulated Data
Generates a short gene with N single nucleotide polymorphisms. The gene has M different known variants (sets of N SNPs) with each m_i expressed in a simulated sample with some proportion. A script generates a SAM and VCF for the reads and SNP positions respectively. The metahaplome is constructed and traversed and the algorithm is evaluated by its ability to recover the M known variants.
Simulated-Real Data
A gene is pulled from a database and submitted to BLAST. Sequences of similar but not exact identity are identified and aligned to the original gene. The extracted hits are aligned to the original gene and variants are called loosely with samtools. Each gene is then fragmented into k-mers that act as artificial reads for the construction of the metahaplome. In a similar fashion to before, the metahaplome is traversed and the algorithm is evaluated by its ability to recover the genes extracted from the BLAST search. Although using real data, this method is still rather naive in itself and further analysis would be needed to evaluate the algorithm’s stability when encountering:
- Indels
- Noise and error
- Poor coverage
- Very skewed proportions of m_i
Real Data for Real
Variant calling is completed on real reads that align to a region on a metagenomic assembly that looks “interesting”. A metahaplome is constructed and traversed. The resulting paths typically match hypothetical or uncharacterised proteins with some identity. This is exciting and impossible to evaluate empirically which is nice because nobody can prove how the results are incorrect yet.

In general the algorithm performs well on triviomes, which is good news considering their simplicity. However, mixed results are gained from simulated-real data, but I don’t have enough evidence as to why this is the case. The real issue here stems from the difficulty in generating test data in an acceptable form for my own software. Reads must be aligned and SNPs called beforehand, but the software for assembly, variant calling and short read alignment are external to my own work and can produce results that I might not consider optimal for evaluation. In particular, when generating triviomes, I had difficulties with getting a short read aligner to make read alignments that I would expect to see — for this reason, at this time the triviome script generates its own SAM.

Problems at both ends

My trouble isn’t limited to the construction of the metahaplome either. Whilst the majority of initial paths recovered by my algorithm are on target to a gene that we know exists in the sample, we want to go on to recover the second, third, …, i‘th best paths from the graph. To do this, the edges in the graph must be re-weighted. My preliminary work shows there is quite an optimal knife-edge here: aggressive re-weighting causes the algorithm to fail to return similar paths (even if they do really exist in the evidence), but modest re-weighting causes the algorithm to converge on new paths slowly (or not at all).

The situation is further complicated by coverage. An “important” edge in the graph (i.e. is expected to be included in many of the actual genes) may have very little evidence, and aggressive re-weighting doesn’t afford the algorithm the opportunity to explore such branches before they are effectively pruned away. Any form of re-weighting must consider that some edges are covered more than others, but it is unknown to us whether that is due to over-representation in the sample or whether that edge really should appear as part of many paths.

My current strategy is triggered when a path has been recovered. For each edge in the newly extracted path (where an edge represents one SNP followed by another), the marginal distributions of the selected transition is inspected. Every selected edge is then reduced in proportion to the value of the lowest marginal: i.e. the least likely transition observed on the new path. Thus far this seems to strike a nice balance but testing has been rather limited.

What now?

Simplify generation of evaluation data sets, currently this bottleneck is a pain in the ass and holding up progress.
Standardise testing and keep track of results as part of a test suite instead of ad-hoc tweak-and-test.
Use multiple new curated simulated-real data sets to explore and optimise the algorithm’s behaviour.
Jury still out on edge re-weighting methodology.

In Other News

Publishing of first paper imminent!
Co-authored a research grant to acquire funding to test the results of the metahaplome recovery algorithm in a lab.
My PR to deprecate legacy samtools sort syntax was accepted for the 1.3 release and I got thanked on the twitters :’)
A couple of samtools odd-jobs, including a port of bamcheckR to samtools stats in the works…
sunblock still saving me hours of head-banging-on-desk time but not tidy enough to tell you about yet…
I’ll be attending the Microbiology Society Annual Conference in March. Say hello!

tl;dr

I’m still alive.
This stuff is quite hard which probably means it will be worth a PhD in the long run.
I am still bad at blog.

Sorry not sorry. ↩
The video is just short of twelve minutes, but YouTube’s analytics tell me the average viewer gives up after 5 minutes 56 seconds. Which is less than ten seconds after I mention the next segment of the talk will contain statistics. Boo. ↩
That is, I haven’t come up with a catchy, concise and witty acronym for it yet. ↩

Status Report: October 2015

Sam — Sun, 01 Nov 2015 19:30:25 +0000

As is customary with any blog that I attempt to keep, I’ve somewhat fallen behind in providing timely updates and am instead hoarding drafts in various states of readiness. This was unhelped by my arguably ill thought out move to install WordPress and the rather painful migration that followed as a result. Now that the dust has mostly settled, I figured it might be nice to outline what I am actually working on before inevitably publishing a new epic tale of computational disaster.

The bulk of my work falls under two main projects that should hopefully sound familiar to those who follow the blog:

Metagenomes

I’ve now entered the second year of my PhD at Aberystwyth University, following my recent fries-and-waffle-fueled research adventure in Belgium. As a brief introduction to the uninitiated, I work in metagenomics: the study of all genetic sequences found in an environment. In particular, I’m interested in the metagenomes of microbial populations that have adapted to produce “interesting” enzymes (catalysts for chemical reactions). A few weeks ago, I presented a poster on the “metahaplome“¹ which is the culmination of my first year of work, to define and formalize how variation in sequences that produce these enzymes can be collected and organized.

DNA Quality Control

Over the summer, I returned to the Wellcome Trust Sanger Institute to continue some work I started as part of my undergraduate thesis. I’ve introduced the task previously and so will spare you the long winded description, but the project initially stalled due to the significant time and effort required to prepare part of the data set. During my brief re-visit, I picked up where I left off with the aim to complete the data set. You may have read that I encountered several problems along the way, and even when this mammoth task finally appeared complete, it was not. Shortly after arriving in Leuven, the final execution of the sample improvement pipeline was done. We’re ready to move forward with the analysis.

Side Projects

As is inevitable when you give a PhD to somebody with a short attention span, I have begun to accumulate some side projects:

SAMTools

The Sequence Alignment and Mapping Tools² suite is a hugely popular open source bioinformatics tookit for interacting with sequencing data. During my undergraduate thesis I contributed a naive header parser to a project fork, that improved the speed of merges of large numbers of sequence files by several orders of magnitude. Recently, amongst a few small fixes here and there, I’ve added functionality to produce samtools stats output split by tags (such as @RG lines) and submitted a proposal to deprecate legacy samtools sort usage. With some time over upcoming holidays, I hope to finally contribute a proper header parser in time for samtools 1.4.

goldilocks

You may remember that I’d authored a Python package called goldilocks (YouTube: Goldilocks: Locating genomic regions that are “just right”, 1st RSG UK Symposium, Oct 2014) as part of my undergraduate work, to find a “just right” 1Mbp region of the human genome that was “representative” in terms of variation expressed. Following some tidying and much optimisation, it’s now a proper package, documented, and I’m now waiting to hear feedback on the submission of my first paper.

sunblock

You may have noticed my opinion on Sun Grid Engine, and the trouble I have had in using it at scale. To combat this, I’ve been working on a small side project called sunblock: a Python command line tool that encapsulates the submission and management of cluster jobs via a more user-friendly interface. The idea is to save anybody else from ever having to use Sun Grid Engine ever again. Thanks to a night in Belgium where it was far too warm to sleep, and a little Django magic, sunblock acquired a super-user-friendly interface and database backend.

Blog

This pain in the arse blog.

tl;dr

I’m still alive
I’m still working
Blogs are hard work

Yes, sorry, it’s another -ome. I’m hoping it won’t find its way on to Jonathan Eisen’s list of #badomes. ↩
Not to be confused with a series of tools invented by me, sadly. ↩