Bioinformatics – Samposium https://samnicholls.net The Exciting Adventures of Sam Mon, 15 Jan 2018 22:12:03 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 Status Report: 2018: The light is at the end of the tunnel that I continue to build https://samnicholls.net/2018/01/15/status-jan18-p1/ https://samnicholls.net/2018/01/15/status-jan18-p1/#respond Mon, 15 Jan 2018 20:39:54 +0000 https://samnicholls.net/?p=2254 Happy New Year!
The guilt of not writing has reached a level where I feel sufficiently obligated to draft a post. You’ll likely notice from the upcoming contents that I am still a PhD student, despite a previous, more optimistic version of myself writing that 2016 would be my final Christmas as a PhD candidate.

Much has happened since my previous Status Report, and I’m sure much of it will spin-off to form several posts of their own, eventually. For the sake of brevity, I’ll give a high level overview.
I’m supposed to be writing a thesis anyway.


Previously on…

We last parted ways with a doublebill status report lamenting the troubles of generating suitable test data for my metagenomic haplotype recovery algorithm, and documenting the ups-and-downs-and-ups-again of analysing one of the synthetic data sets for my pre-print. In particular, I was on a quest to respond to our reviewer’s desire for more realistic data: real reads.

Gretel: Now with real reads!

Part Two of my previous report alluded to a Part Three that I never got around to finishing, on the creation and analysis of a test data set consisting of real reads. This was a major concern of the reviewers who gave feedback on our initial pre-print. Without getting into too much detail (I’m sure there’s time for that); I found a suitable data set consisting of real sequence reads from a lab-mix of five HIV strains, used to benchmark algorithms in the related problem of viral-quasispecies reconstruction. After fixing a small bug, and implementing deletion handling, it turns out we do well on this difficult problem. Very well.

In the same fashion as our synthetic DHFR metahaplome, this HIV data set provided five known haplotypes, representing five different HIV-1 strains. Importantly, we were also provided with real Illumina short-reads from a sequencing run containing a mix of the five known strains. This was our holy grail, finally: a benchmark with sequence reads and a set of known haplotypes. Gretel is capable of recovering long, highly variable genes with 100% accuracy. My favourite result is a recovery of env — the ridiculously hyper-variable envelope gene that encodes the HIV-1 virus’ protein shell — with Gretel correctly recovering all but one of 2,568 positions. Not bad.

A new pre-print

Armed with real-reads, and improved results for our original DHFR test data (thanks to some fiddling with bowtie2), we released a new pre-print. The manuscript was a substantial improvement over its predecessor, which meant it was all the more disappointing to be rejected from five different journals. But, more on this misery at another time.

Despite our best efforts to address the previous concerns, new reviewers felt that our data sets were still not a good representation of the problem-at-hand: “Where is the metagenome?”. It felt like the goal-posts had moved, suddenly real reads were not enough. But it’s both a frustrating and fair response, work should be empirically validated, but there are no metagenomic data sets with both a set of sequence reads, and known haplotypes. So, it was time to make one.

I’m a real scientist now…

And so, I embarked upon what would become the most exciting and frustrating adventure of my PhD. My first experiences of the lab as a computational biologist is a post sat in draft, but suffice to say that the learning curve was steep. I’ve discovered that there are many different types of water and that they all look the same, that 1ml is a gigantic volume, that you’ll lose your fingerprints if you touch a metal drawer inside a -80C freezer, and that contrary to what I might have thought before, transferring tiny volumes of colourless liquids between tiny tubes without fucking up a single thing, takes a lot of time, effort and skill. I have a new appreciation for the intricate and stochastic nature of lab work, and I understand what it’s like for someone to “borrow” a reagent that you spent hours of your time to make from scratch. And finally, I had a legitimate reason to wear an ill-fitting lab coat that I purchased in my first year (2010), to look cool at computer science socials.

With this new-found skill-tree to work on, I felt like I was becoming a proper interdisciplinary scientist, but this comes at a cost. Context switching isn’t cheap, and I was reminded of my undergraduate days where I juggled mathematics, statistics and computing to earn my joint honours degree. I had more lectures, more assignments and more exams than my peers, but this was and still is the cost of my decision to become an interdisciplinary scientist.

And it was often difficult to find much sympathy from either side of the venn diagram…

..and science can be awful

I’ve suffered many frustrations as a programmer. One can waste hours tracking down a bug that turns out to be a simple typo, or more likely, an off by one error that plagues much of bioinformatics. I’ve felt the self-directed anger having submitted thousands of cluster jobs that have failed with a missing parameter, or waited hours for a program to complete, only to discover the disk has run out of room to store the output. Yet, these problems pale into comparison in the face of problems at the bench.

I’ve spent days in the lab, setting-up and executing PCR, casting, loading and running gels, only to take a UV image of absolutely nothing at all.

Last year, I spent most of Christmas sheparding data through our cluster, much to my family’s dismay. This year, I had to miss a large family do for a sister’s milestone birthday. I spent many midnights in the lab, lamenting the life of a PhD student, and shuffling around with angry optimism; “Surely it has to fucking work this time?”. Until finally, I got what I wanted.

I screamed so loud with glee that security came to check on me. “I’m a fucking scientist now!”

New Nanopore Toys

My experiment was simple in practice. Computationally, I’d predicted haplotypes with my Gretel method from short-read Illumina data from a real rumen microbiome. I designed 10 pairs of primers to capture 10 genes of interest (with hydrolytic-activity) using the haplotypes. And finally, after several weeks of constant almost 24/7 lab work, building cDNA libraries and amplifying the genes of interest, I made enough product for the exciting next step: Nanopore sequencing.

With some invaluable assistance from our resident Nanopore expert Arwyn Edwards (@arwynedwards) and PhD student André (@GeoMicroSoares), I sequenced my amplicons on an Oxford Nanopore MinION, and the results were incredible.

Our Nanopore reads strongly supported our haplotypes, and concurred with the Sanger sequencing. Finally, we have empirical biological evidence that Gretel works.

The pre-print rises

With this bomb-shell in the bag, the third version of my pre-print rose from the ashes of the second. We demoted the DHFR and HIV-1 data sets to the Supplement, and included an analysis on our performance with a de facto benchmark mock community introduced by Chris Quince in its place. The data sets and evaluation mechanisms that our previous reviewers found unrepresentative and convoluted were gone. I even got to include a Circos plot.

Once more, we substantially updated the manuscript, and released a new pre-print. We made our to bioRxiv to much Twitter fanfare, earning over 1,500 views in our first week.

This work also addresses every piece of feedback we’ve had from reviewers in the past. Surely, the publishing process would now finally recognise our work and send us out for review, right?

Sadly, the journey of this work is still not smooth sailing, with three of my weekends marred by a Friday desk rejection…

…and a fourth desk rejection on the last working day before Christmas was pretty painful. But we are currently grateful to be in discussion with an editor and I am trying to remain hopeful we will get where we want to be in the end. Wish us luck!


In other news…

Of course, I am one for procrastination, and have been keeping busy while all this has been unfolding…

I hosted a national student conference

I am applying for some fellowships

I’ve officially started my thesis…

…which is just as well, because the money is gone

I’ve started making cheap lab tat with my best friend…

…it’s approved by polar bears

…and the UK Centre for Astrobiology

…and has been to the Arctic

I gave an invited talk at a big conference…

…it seemed to go down well

I hosted UKIEPC at Aber for the 4th year

We’ve applied to fund Monster Lab…

…and made a website to catalogue our monsters

For a change I chose my family over my PhD and had a fucking great Christmas


What’s next?

  • Get this fucking great paper off my desk and out of my life
  • Hopefully get invited to some fellowship interviews
  • Continue making cool stuff with Sam and Tom Industrys
  • Do more cool stuff with Monster Lab
  • Finish this fucking thesis so I can finally do something else

tl;dr

  • Happy New Year
  • For more information, please re-read
]]>
https://samnicholls.net/2018/01/15/status-jan18-p1/feed/ 0 2254
Status Report: November 2016 (Part II): Revisiting the Synthetic Metahaplomes https://samnicholls.net/2016/12/24/status-nov16-p2/ https://samnicholls.net/2016/12/24/status-nov16-p2/#respond Sat, 24 Dec 2016 01:19:57 +0000 https://samnicholls.net/?p=1916 In the opening act of this status report, I described the abrupt and unfortunate end to the adventure of my pre-print. In response to reviewer feedback, I outlined three major tasks that lie ahead in wait for me, blocking the path to an enjoyable Christmas holiday, and a better manuscript submission to a more specialised an alternative journal:

  • Improve Triviomes
    We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.
  • Polish DHFR and AIMP1 analysis
    One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the DHFR and AIMP1 data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.

  • Real Reads
    Create and analyse a data set consisting of real reads.

The first part of this report covers my vented frustrations with both generating, but particularly analysing the new “Treeviomes” in depth. This part will now turn focus to the sequel of my already existing DHFR analysis, addressing the reasoning behind a particularly crap set of recoveries and reassuring myself that perhaps everything is not a disaster after all.

Friends and family, you might be happy to know that I am less grumpy than last week, as a result of the work described in this post. You can skip to the tldr now.

DHFR II: Electric Boogaloo

Flashback

Our preprint builds upon the initial naive testing of Hansel and Gretel on recovery of randomly generated haplotypes, by introducing a pair of metahaplomes that each contain five haplotypes of varying similarity of a particular gene; namely DHFR and AIMP1.

Testing follows the same formula as we’ve discussed in a previous blog post:

  • Select a master gene (an arbitrary, DHFR, in this case)
  • megaBLAST for similar sequences and select 5 arbitrary sequences of decreasing identity
  • Extract the overlapping regions on those sequences, these are the input haplotypes
  • Generate a random set of (now properly) uniformly distributed reads, of a fixed length and coverage, from each of the input haplotypes
  • Use the master as a pseudo-reference against which to align your synthetic reads
  • Stand idly by as many of your reads are hopelessly discarded by your favourite aligner
  • Call for variants on what is left of your aligned reads by assuming any heterogeneous site is a variant
  • Feed the aligned reads and called variant positions to Gretel
  • Compare the recovered haplotypes, to the input haplotypes, with BLAST

As described in our paper, results for both the DHFR and AIMP1 data sets were promising, with up to 99% of SNPs correctly recovered to successfully reconstruct the input haplotypes. We have shown that haplotypes can be recovered from short reads that originate from a metagenomic data set. However, I want to revisit this analysis to improve the story and fix a few outstanding issues:

  • A reviewer assumed lower recovery rates on haplotypes with poorer identity to the pseudo-reference was purely down to bad decision making by Gretel, and not to do with the long paragraph about how this is caused by discarded reads
  • A reviewer described the tabular results as hard to follow
  • I felt I had not provided a deep enough analysis to readers into the effects of read length and coverage, although no reviewer asked for this, it will help me sleep at night
  • Along the same theme, the tables in the pre-print only provided averages, whereas I felt a diagram might better explain some of the variance observed in recovery rates
  • Additionally, I’d updated the read generator as part of my work on the “Treeviomes” discussed in my last post, particularly to generate more realistic looking distributions of reads, I needed to check results were still reliable and thought it would be helpful if we had a consistent story between our data sets

Progress

Continuing the rampant pessimism that can be found in the first half of this status report, I wanted to audit my work on these “synthetic metahaplomes”, particularly given the changed results between the Triviomes, and more biologically realistic and tangible Treeviomes. I wanted to be sure that we had not departed from the good results introduced in the DHFR and AIMP1 sections of the paper.

To be a little less gloomy, I was also inspired by the progress I had made on the analysis of the Treeviomes. I felt we could we could try and uniformly present our results and use box diagrams similar to those that I had finally managed to design and plot at the end of my last post. I feel these diagrams provide a much more detailed insight to the capabilities of Gretel given attributes of read sets such as length and coverage. Additionally, we know each of the input haplotypes, which unlike the uniformly varying sequences of our Treeviomes, have decreasing similarity to the pseudo-reference; thus we can use such a plot to describe Gretel‘s accuracy with respect to each haplotype’s distance from the pseudo-reference (somewhat akin to how we present results for different values of per-haplotype per-base sequence mutation rates on our Treeviomes).

So here’s one I made earlier:

Sam, what the fuck am I looking at?

  • Vertical facets are read length
  • Horizontal facets represent each of the five input haplotypes, ordered by decreasing similarity to the chosen DHFR pseudo-reference
  • X-axis of each boxplot is the average read coverage for each of the ten generated read sets
  • Y-axis is the average best recovery rate for a given haplotype, over the ten runs of Gretel: each boxplot summarising the best recoveries for ten randomly generated uniform distributed sets of reads (with a set read length and coverage)
  • Top scatter plot shows number of called variants across the ten sets of reads for each length-coverage parameter pair
  • Bottom scatter plot shows for each of the ten read sets, the proportion of reads from each haplotype that were dropped during alignment

What can we see?

  • Everything is not awful: three of the five haplotypes can be recovered to around 95% accuracy, even with very short reads (75-100bp), and reasonable average per-haplotype coverage.
  • Good recovery of AK232978, despite a 10% mutation rate when compared to the pseudo-reference: Gretel yielded haplotypes with 80% accuracy.
  • Recovery of XM_012113510 is particularly difficult. Even given long reads (150-250bp) and high coverage (50x) we fail to reach 75% accuracy. Our pre-print hypothesiszed that this was due to its 82.9% identity to the pseudo-reference causing dropped reads.

Our pre-print presents results considering just one read set for each of the length and coverage parameter pairs. Despite the need to perhaps take those results with a pinch of salt, it is somewhat relieving to see that our new more robust experiment yields very similar recovery rates across each of the haplotypes. With the exception of XM_012113510.

The Fall of “XM”

Recovery rates for our synthetic metahaplomes, such as DHFR, are found by BLASTing the haplotypes returned by Gretel against the five known, input haplotypes. The “best haplotype” for an input is defined as the output haplotype with the fewest mismatches against the given input. We report the recovery rate by assuming the mismatches are the incorrectly recovered variants, and divide this by the number of variants called over the read set that yielded this haplotype.

We cannot simply use sequence identity as a score: consider a sequence of 100bp, if there are just 5 variants, all of which are incorrectly reconstructed, Gretel will still appear to be 95% accurate, rather than correctly reporting 0% (1 – 5/5). Note however, that our method does somewhat assume that variants are called perfectly: remember that homogenous sites are ignored by Gretel, so any “missed” SNPs, will always mismatch the input haplotypes, as we “fill in” homogeneous sites when outputting the output haplotype FASTA, with the nucleotides from the pseudo-reference.

XM_012113510 posed an interesting problem for evaluation. When collating results for the pre-print, I found the corresponding best haplotype failed to align the entirity of the gene (564bp) and instead yielded two high scoring, non-overlapping hits. When I updated the harness that calculates and collates recovery rates across all the different runs of Gretel, I overlooked the possibility for this and forced selection of the “best” hit (-max_hsps 1). Apparently, this split-hit behaviour was not a one-off, but almost universal across all runs of Gretel over the 640 sets of reads.

In fact, the split-hit was highly precise. Almost every best XM_012113510 haplotype had a hit along the first 87bp, and another starting at (or very close to) 135bp, stretching across the remainder of the gene. My curiosity piqued, I fell down another rabbit hole.

Curious Coverage

I immediately assumed this must be an interesting side-effect of read dropping. I indicated in our pre-print that we had difficulty aligning reads against a pseudo-reference where those reads originate from a sequence that is dissimilar (<= 90% identity) from the reference. This read dropping phenomenon has been one of the many issues our lab has experiened with current bioinformatics tooling when dealing with metagenomic data sets.

Indeed, our boxplot above features a colour-coded scatter plot that demonstrates that the two most dissimilar input haplotypes AK and XM, are also the haplotypes that experience the most trouble during read alignment. I suggested in the pre-print that thes dropped reads are the underlying reason for Gretel‘s performance on those more dissimilar haplotypes.

I wanted to see empirically whether this was the case, and to potentially find a nice way of providing evidence, given that one of our reviewers missed my attribution of these poorer AK and XM recoveries to trouble with our alignment step, rather than incompetence on Gretel‘s part. I knocked up a small Python script that deciphered my automatically generated read names, and calculated per-haplotype read coverage, for each of the five input haplotypes, across the 640 alignments.

  • X-axis represents genomic position along the psuedo-reference (the “master” `DHFR gene)
  • Y-axis labels each of the input haplotype, starting with the most dissimilar (XM) and ending on the most similar (BC)
  • The coloured heatmap plots the average coverage for a given haplotype at a given position, across all generated BAM files

Well look what we have here… Patches of XM and AK are most definitely poorly supported by reads! This is troubling for their recovery, as Hansel stores pairwise evidence of variants co-occurring on the same read. If those reads are discarded, we have no evidence of that haplotype. It should be no surprise that Gretel has difficultly here. We can’t create evidence from nothing.

What’s this?… A crisp, ungradiated, dark navy box representing absolutely 0 reads across any of the 640 alignments, sitting on our XM gene. If I had to guesstimate the bounds of that box on the X-axis, I would bet my stipend that they’ll be a stonesthrow from 87 and 135bp… The bounds of the split BLAST hits we were reliably generating against XM_012113510, almost 640 times. I delved deeper. What did the XM reads over that region look like? Why didn’t they align correctly? Where they misaligned, or dropped entirely?

But I couldn’t find any.

Having a blast with BLAST

After much head scratching. I began an audit from the beginning, consulting the BLAST record that led me to select XM_012113510.1 for our study:

EU145592.1   XM_012113510.1   82.979   564   46   1   1   564   52   565   3.89e-175   625

Seems legit? An approximately 83% hit with full query coverage (564bp) consisting of 46 mismatches and containing an indel of length one? Not quite, as I would later find. But to cut an already long story somewhat shorter. The problem boils down to me being unable to read, again.

  • BLAST outfmt6 alignment lengths include deletions on the subject
  • BLAST outfmt6 tells you the number of gaps, not their size

For fuck’s sake, Sam. Sure enough, here’s the full alignment record. Complete with a fucking gigantic 50bp deletion on the subject, XM_012113510.

Query  1    ATGGTTGGTTCGCTAAACTGCATCGTCGCTGTGTCCCAGAACATGGGCATCGGCAAGAAC  60
            |||||| || ||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  52   ATGGTTCGTCCGCTAAACTGCATCGTCGCTGTGTCCCAGAACATGGGCATCGGCAAGAAC  111

Query  61   GGGGACCTGCCCTGGCCACCGCTCAGGAATGAATTCAGATATTTCCAGAGAATGACCACA  120
            ||| |||||||||||||||| |||||                                  
Sbjct  112  GGGAACCTGCCCTGGCCACCACTCAG----------------------------------  137

Query  121  ACCTCTTCAGTAGAAGGTAAACAGAATCTGGTGATTATGGGTAAGAAGACCTGGTTCTCC  180
                            ||||||||||| ||||||||||||||| ||||||||||||||||
Sbjct  138  ----------------GTAAACAGAATTTGGTGATTATGGGTAGGAAGACCTGGTTCTCC  181

Query  181  ATTCCTGAGAAGAATCGACCTTTAAAGGGTAGAATTAATTTAGTTCTCAGCAGAGAACTC  240
            ||||| ||||||||||||||||||||||  ||||||||| |||||||||| |||||||||
Sbjct  182  ATTCCAGAGAAGAATCGACCTTTAAAGGACAGAATTAATATAGTTCTCAGTAGAGAACTC  241

Query  241  AAGGAACCTCCACAAGGAGCTCATTTTCTTTCCAGAAGTCTAGATGATGCCTTAAAACTT  300
            |||||||||||| | ||||||||||||||| ||| |||||| |||||||||||| |||||
Sbjct  242  AAGGAACCTCCAAAGGGAGCTCATTTTCTTGCCAAAAGTCTGGATGATGCCTTAGAACTT  301

Query  301  ACTGAACAACCAGAATTAGCAAATAAAGTAGACATGGTCTGGATAGTTGGTGGCAGTTCT  360
            | |||| | ||||||||| ||||||||||||||||||| |||||||| || |||||||||
Sbjct  302  ATTGAAGATCCAGAATTAACAAATAAAGTAGACATGGTTTGGATAGTGGGAGGCAGTTCT  361

Query  361  GTTTATAAGGAAGCCATGAATCACCCAGGCCATCTTAAACTATTTGTGACAAGGATCATG  420
            || |||||||||||||||||  | ||||||||||||| ||||||||||||||||||||||
Sbjct  362  GTATATAAGGAAGCCATGAACAAGCCAGGCCATCTTAGACTATTTGTGACAAGGATCATG  421

Query  421  CAAGACTTTGAAAGTGACACGTTTTTTCCAGAAATTGATTTGGAGAAATATAAACTTCTG  480
            ||||| |||||||||||   |||||| |||||||||||||| || |||||||||||||| 
Sbjct  422  CAAGAATTTGAAAGTGATGTGTTTTTCCCAGAAATTGATTTTGAAAAATATAAACTTCTT  481

Query  481  CCAGAATACCCAGGTGTTCTCTCTGATGTCCAGGAGGAGAAAGGCATTAAGTACAAATTT  540
            |||||||| ||||||||||  |  |||||||||||||| |||||||||||||||||||||
Sbjct  482  CCAGAATATCCAGGTGTTCCTTTGGATGTCCAGGAGGAAAAAGGCATTAAGTACAAATTT  541

Query  541  GAAGTATATGAGAAGAATGATTAA  564
            ||||||||||| |||||  |||||
Sbjct  542  GAAGTATATGAAAAGAACAATTAA  565

It is no wonder we couldn’t recover 100% of the XM_012113510 haplotype, when compared to our pseudo-reference, 10% of it doesn’t fucking exist. Yet, it is interesting to see that the best recovered XM_012113510‘s were identified by BLAST to be very good hits to the XM_012113510 that actually exists in nature, despite my spurious 50bp insertion. Although Gretel is still biased by the structure of the pseudo-reference, (which is one of the reasons that insertions and deletions are still a pain), we are still able to make accurate recoveries around straightforward indel sites like this.

As lovely as it is to have gotten to the bottom of the XM_012113510 recovery troubles, this doesn’t really help our manuscript. We want to avoid complicated discussions and workarounds that may confuse or bore the reader. We already had trouble with our reviewers misunderstanding the purpose of the convoluted Triviome method and its evaluation. I don’t want to have to explain why recovery rates for XM_012113510 need special consideration because of this large indel.

I decided the best option was to find a new member of the band.

DHFR: Reunion Tour

As is the case whenever I am certain something won’t take the entire day, this was not as easy as I anticipated. Whilst it was trivial to BLAST my chosen pseudo-reference once more and find a suitable replacement (reading beyond “query cover” alone this time), trouble arose once more in the form of reads dropped via alignment.

Determined that these sequences should (and must) align, I dedicated the day to finally find some bowtie2 parameters that would more permissively align sequences to my pseudo-reference to ensure Gretel had the necessary evidence to recover even the more dissimilar sequences:

Success.

So, what’s next?

Wonderful. So after your productive day of heatmap generation, what are you going to do with your life new found alignment powers?

✓ Select a new DHFR line-up

The door has been blown wide open here. Previously our input test haplotypes had a lower bound of around 90% sequence similarity to avoid the loss of too many reads. However with my definition that I have termed --super-sensitive-local, we can attempt recoveries of genes that are even more dissimilar from the reference! I’ve actually already made the selection, electing to keep BC (99.8%), XR (97.3%) and AK (90.2%). I’ve remove KJ (93.2%) and the troublemaking XM (85.1%) to make way for the excitingly dissimilar M (83.5%) and (a different) XM (78.7%).

✓ Make a pretty picture

After some chitin and Sun Grid Engine wrangling, I present this:

✓ Stare in disbelief

Holy crap. It works.


Conclusion

  • Find what is inevitably wrong with these surprisingly excellent results
  • Get this fucking paper done
  • Merry Christmas

tl;dr

  • In a panic I went back to check my DHFR analysis and then despite it being OK, changed everything anyway
  • One of the genes from our paper is quite hard to recover, because 10% of it is fucking missing
  • Indels continue to be a pain in my arse
  • Contrary to what I said last week, not everything is awful
  • Coverage really is integral to Gretel‘s ability to make good recoveries
  • I’ve constructed some parameters that allow us to permissively align metagenomic reads to our pseudo-reference and shown it to massively improve Gretel‘s accuracy
  • bowtie2 needs some love <3
  • I probably care far too much about integrity of data and results and should just write my paper already
  • A lot of my problems boil down to not being able to read
  • I owe Hadley Wickham a drink to making it so easy to make my graphs look so pretty
  • Everyone agrees this should be my last Christmas as a PhD
  • Tom is mad because I keep telling him about bugs and pitfalls I have find in bioinformatics tools
]]>
https://samnicholls.net/2016/12/24/status-nov16-p2/feed/ 0 1916
bowtie2: Relaxed Parameters for Generous Alignments to Metagenomes https://samnicholls.net/2016/12/24/bowtie2-metagenomes/ https://samnicholls.net/2016/12/24/bowtie2-metagenomes/#respond Sat, 24 Dec 2016 00:34:46 +0000 https://samnicholls.net/?p=1991 In a change to my usual essay length posts, I wanted to share a quick bowtie2 tip for relaxing the parameters of alignment. It’s no big secret that bowtie2 has these options, and there’s some pretty good guidance in the manual, too. However, we’ve had significant trouble in our lab finding a suitable set of permissive alignment parameters.

In the course of my PhD work on haplotyping regions of metagenomes, I have found that even using bowtie2‘s somewhat permissive --very-sensitive-local, that sequences with an identity to the reference of less than 90% are significantly less likely to align back to that reference. This is problematic in my line of work, where I wish to recover all of the individual variants of a gene, as the basis of my approach relies on a set of short reads (50-250bp) aligned to a position on a metagenomic assembly (that I term the pseudo-reference). It’s important to note that I am not interested in the assembly of individual genomes from metagenomic reads, but the genes themselves.

Recently, the opportunity arose to provide some evidence to this. I have some datasets which constitute “synthetic metahaplomes” that consist of a handful of arbitrary known genes that all perform the same function, each from a different organism. These genes can be broken up into synthetic reads and aligned to some common reference (another gene in the same family).

This alignment can be used a means to test my metagenomic haplotyper; Gretel (and her novel brother data structure, Hansel), by attempting to recover the original input sequences, from these synthetic reads. I’ve already reported in my pre-print that our method is at the mercy of the preceding alignment, and used this as the hypothesis for a poor recovery in one of our data sets.

Indeed as part of my latest experiments, I have generated some coverage heat maps, showing the average coverage of each haplotype (Y-axis) at each position of the pseudo-reference (X-axis) and I’ve found that for sequences beyond the vicinity of 90% sequence identity, --very-sensitive-local becomes unsuitable.

The BLAST record below represents the alignment that corresponds to the gene whose reads go on to align at the average coverage depicted at the top bar of the above heatmap. Despite its 79% identity, it looks good(TM) to me, and I need sequence of this level of diversity to align to my pseudo-reference so it can be included in Gretel‘s analysis. I need generous alignment parameters to permit even quite diverse reads (but hopefully not too diverse such that it is no longer a gene of the same family) to map back to my reference. Otherwise Gretel will simply miss these haplotypes.

So despite having already spent many days of my PhD repeatedly failing to increase my overall alignment rates for my metagenomes, I felt this time it would be different. I had a method (my heatmap) to see how my alignment parameters affected the alignment rates of reads on a per-haplotype basis. It’s also taken until now for me to quantify just what sort of sequences we are missing out on, courtesy of dropped reads.

I was determined to get this right.

For a change, I’ll save you the anticipation and tell you what I settled on after about 36 hours of getting cross.

  • --local -D 20 -R 3
    Ensure we’re not performing end-to-end alignment (allow for soft clipping and the like), and borrow the most sensitive default “effort” parameters.
  • -L 3
    The seed substring length. Decreasing this from the default (20 - 25) to just 3 allows for a much more aggressive alignment, but adds computational cost. I actually had reasonably good results with -L 11, which might suit you if you have a much larger data set but still need to relax the aligner.
  • -N 1
    Permit a mismatch in the seed, because why not?
  • --gbar 1
    Has a small, but noticeable effect. Appears to thin the width of some of the coverage gap in the heatmap at the most stubborn sites.
  • --mp 4
    Reduces the maximum penalty that can be applied to a strongly supported (high quality) mismatch by a third (from the default value of 6). The aggregate sum of these penalties are responsible for the dropping of reads. Along with the substring length, this had a significant influence on increasing my alignment rates. If your coverage stains are stubborn, you could decrease this again.

Tada.


tl;dr

  • bowtie2 --local -D 20 -R 3 -L 3 -N 1 -p 8 --gbar 1 --mp 3
]]>
https://samnicholls.net/2016/12/24/bowtie2-metagenomes/feed/ 0 1991
Status Report: November 2016 (Part I): Triviomes, Treeviomes & Fuck Everything https://samnicholls.net/2016/12/19/status-nov16-p1/ https://samnicholls.net/2016/12/19/status-nov16-p1/#respond Mon, 19 Dec 2016 23:14:33 +0000 https://samnicholls.net/?p=1720 Once again, I have adequately confounded progress since my last report to both myself, and my supervisorial team such that it must be outlaid here. Since I’ve got back from having a lovely time away from bioinformatics, the focus has been to build on top of our highly shared but unfortunately rejected pre-print: Advances in the recovery of haplotypes from the metagenome.

I’d hoped to have a new-and-improved draft ready by Christmas, in time for an invited talk at Oxford, but sadly I’ve had to postpone both. Admittedly, it has taken quite some time for me to dust myself down after having the entire premise of my PhD so far rejected without re-submission, but I have finally built up the motivation to revisit what is quite a mammoth piece of work, and am hopeful that I can take some of the feedback on board to rein in the new year with an even better paper.

This will likely be the final update of the year.
This is also the last Christmas I hope to be a PhD candidate.

Friends and family can skip to the tldr

The adventure continues…

We left off with a lengthy introduction to my novel data structure; Hansel and algorithm; Gretel. In that post I briefly described some of the core concepts of my approach, such as how the Hansel matrix is reweighted after Gretel successfully creates a path (haplotype), how we automatically select a suitable value for the “lookback” parameter (i.e. the order of the Markov chain used when calculating probabilities for the next variant of a haplotype), and the current strategy for smoothing.

In particular, I described our current testing methodologies. In the absence of metagenomic data sets with known haplotypes, I improvised two strategies:

  • Trivial Haplomes (Triviomes)
    Data sets designed to be finely controlled, and well-defined. Short, random haplotypes and sets of reads are generated. We also generate the alignment and variant calls automatically to eliminate noise arising from the biases of external tools. These data sets are not expected to be indicative of performance on actual sequence data, but rather represent a platform on which we can test some of the limitations of the approach.

  • Synthetic Metahaplomes
    Designed to be more representative of the problem, we generate synthetic reads from a set of similar genes. The goal is to recover the known input genes, from an alignment of their reads against a pseudo-reference.

I felt our reviewers misunderstood both the purpose and results of the “triviomes”. In retrospect, this was probably due to the (albeit intentional) lack of any biological grounding distracting readers from the story at hand. The trivial haplotypes were randomly generated, such that none of them had any shared phylogeny. Every position across those haplotypes was deemed a SNP, and were often tetra-allelic. The idea behind this was to cut out the intermediate stage of needing to remove homogeneous positions across the haplotypes (or in fact, from even having to generate haplotypes that had homogeneous positions). Generated reads were thus seemingly unrealistic, at a length of 3-5bp. However they meant to represent not a 3-5bp piece of sequence, but the 3-5bp sequence that remains when one only considers genomic positions with variation, i.e. our reads were simulated such they spanned between 3 and 5 SNPs of our generated haplotypes.

I believe these confusing properties and their justifications got in the way of expressing their purpose, which was not to emulate the real metahaplotying problem, but to introduce some of the concepts and limitations of our approach in a controlled environment.

Additionally, our reviewers argued that the paper is lacking an extension to the evaluation of synthetic metahaplomes: data sets that contain real sequencing reads. Indeed, I felt that this was probably the largest weakness of my own paper, especially as it would not require an annotated metagenome. Though, I had purposefully stayed on the periphery of simulating a “proper” metagenome, as there are ongoing arguments in the literature as to the correct methodology and I wanted to avoid the simulation itself being used against our work. That said, it would be prudent to at least present small synthetic metahaplomes akin to the DHFR and AIMP1, using real reads.

So this leaves us with a few major plot points to work on before I can peddle the paper elsewhere:

  • Improve Triviomes
    We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.
  • Polish DHFR and AIMP1 analysis
    One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the *DHFR* and *AIMP1* data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.
  • Real Reads
    Create and analyse a data set consisting of real reads.

The remainder of this post will focus on the first point, because otherwise no-one will read it.


Triviomes and Treeviomes

After a discussion about how my Triviomes did not pay off, where I believe I likened them to “random garbage”. It was clear that we needed a different tactic to introduce this work. Ideally this would be something simple enough that we could still have total control over both the metahaplome to be recovered, and the reads to recover it from, but also yield a simpler explanation for our readers.

My biology-sided supervisor, Chris, is an evolutionary biologist with a fetish for trees. Throughout my PhD so far, I have managed to steer away from phylogenetic trees and the like, especially after my terrifying first year foray into taxonomy, where I discovered that not only can nobody agree on what anything is, or where it should go, but there are many ways to skin a cat draw a tree.

Previously, I presented the aggregated recovery rates of randomly generated metahaplomes, for a series of experiments, where I varied the number of haplotypes, and their length. Remember that every position of these generated haplotypes was a variant. Thus, one may argue that the length of these random haplotypes was a poor proxy for genetic diversity. That is, we increased the number of variants (by making longer haplotypes) to artificially increase the level of diversity in the random metahaplome, and make recoveries more difficult. Chris pointed out that actually, we could specify and fix the level of diversity, and generate our haplotypes according to some… tree.

This seemed like an annoyingly neat and tidy solution to my problem. Biologically speaking, this is a much easier explanation to readers; our sequences will have meaning, our reads will look somewhat more realistic and most importantly, the recovery goal is all the more tangible. Yet at the same time, we still have precise control over the tree, and we can generate the synthetic reads in exactly the same way as before, allowing us to maintain tight control of their attributes. So, despite my aversion to anything that remotely resembles a dendrogram, on this occasion, I have yielded. I introduce the evaluation strategy to supplant1 my Triviomes: Treeviomes.

(Brief) Methodology

  • Heartlessly throw the Triviomes section in the bin
  • Generate a random start DNA sequence
  • Generate a Newick format tree. The tree is a representation of the metahaplome that we will attempt to recover. Each branch (taxa) of the tree corresponds to a haplotype. The shape of the tree will be a star, with each branch of uniform length. Thus, the tree depicts a number of equally diverse taxa from a shared origin
  • Use the tree to simulate evolution of the start DNA sequence to create the haplotypes that comprise the synthetic metahaplome
  • As before, generate reads (of a given length, at some level of coverage) from each haplotype, and automatically generate the alignment (we know where our generated reads should start and end on the reference without external tools) and variant calls (any heterogeneous genomic position when the reads are piled up)
  • Rinse and repeat, make pretty pictures

The foundation for this part of the work is set. Chris even recommended seq-gen as a tool that can simulate evolution from a starting DNA sequence, following a Newick tree, which I am using to generate our haplotypes. So I now have a push-buttan-to-metahaplome workflow that generates the necessary tree, haplotypes, and reads for testing Gretel.

I’ve had two main difficulties with Treeviomes…

• Throughput

Once again, running anything thousands of times has proven the bane of my life. Despite having a well defined workflow to generate and test a metahaplome, getting the various tools and scripts to work on the cluster here has been a complete pain in my arse. So much so, I ended up generating all of the data on my laptop (sequentially, over the course of a few days) and merely uploading the final BAMs and VCFs to our compute cluster to run Gretel. This has been pretty frustrating, especially when last weekend I set my laptop to work on creating a few thousand synthetic metahaplomes and promised some friends that I’d take the weekend off work for a change, only to find on Monday that my laptop had done exactly the same.

• Analysis

Rather unexpectedly, initial results raised more questions than answers. This was pretty unwelcome news following the faff involved in just generating and testing the many metahaplomes. Once Gretel‘s recoveries were finished (the smoothest part of the operation, which was a surprise in itself, given the presence of Sun Grid Engine), another disgusting munging script of my own doing spat out the convoluted plot below:

The figure is a matrix of boxplots where:

  • Horizontal facets are the number of taxa in the tree (i.e. haplotypes)
  • Vertical facets are per-haplotype, per-base mutation rates (i.e. the probability that any genomic position on any of the taxa may be mutated from the common origin sequence)
  • X-axis of each boxplot represents each haplotype in the metahaplome, labelled A – O
  • Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for a given haplotype A – O, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set of 150bp at 7x per-haplotype coverage)

We could make a few wild speculations, but no concrete conclusions:

  • At low diversity, it may be impossible to recover haplotypes, especially for metahaplomes containing fewer haplotypes
  • Increasing diversity appears to create more variance in accuracy, but mean accuracy increases slightly in datasets with 3-5 haplotypes, but falls under 10+
  • Increasing the number of haplotypes in the metahaplome appears to increase recovery accuracy
  • In general, whilst there is variation, recovery rates across haplotypes is fairly clustered
  • It is possible to achieve 100% accuracy for some haplotypes under high diversity, and few true haplotypes

The data is not substantial on the surface. But, if anything, I had seemed to refute my own pre-print. Counter-intuitively, we now seem to have shown that the problem is easier in the presence of more haplotypes, and more variation. I was particularly disappointed with the ~80% accuracy rates on mid-level diversity on just 3 haplotypes. Overall, comparing the recovery accuracy to that of my less realistic Triviomes, appeared worse.

This made me sad, but mostly cross.

The beginning of the end of my sanity

I despaired at the apparent loss of accuracy. Where had my over 90% recoveries gone? I could feel my PhD pouring away through my fingers like sand. What changed here? Indeed, I had altered the way I generated reads since the pre-print, was it the new read shredder? Or are we just less good at recovering from more realistic metahaplomes? With the astute assumption that everything I am working on equating to garbage, I decided to miserably withdraw from my PhD for a few days to play Eve Online…

I enjoyed my experiences of space. I began to wonder whether I should quit my PhD and become an astronaut, shortly before my multi-million ISK ship was obliterated by pirates. I lamented my inability to enjoy games that lack copious micromanagement, before accepting that I am destined to be grumpy in all universes and that perhaps for now I should be grumpy in the one where I have a PhD to write.

In retrospect, I figure that perhaps the results in my pre-print and the ones in our new megaboxplot were not in disagreement, but rather incomparable in the first place. Whilst an inconclusive conclusion on that front would not answer any of the other questions introduced by the boxplots, it would at least make me a bit feel better.

Scattering recovery rates by variant count

So I constructed a scatter plot to show the relationship between the number of called variants (i.e. SNPs), and best Gretel recovery rate for each haplotype of all of the tested metahaplomes (dots coloured by coverage level below), against the overall best average recovery rates from my pre-print (large black dots).

Immediately, it is obvious that we are discussing a difference in magnitude when it comes to numbers of called variants, particularly when base mutation rates are high. But if we are still looking for excuses, we can consider the additional caveats:

  • Read coverage from the paper is 3-5x per haplotype, whereas our new data set uses a fixed coverage of 7x
  • The number of variants on the original data sets (black dots) are guaranteed, and bounded, by their length (250bp max)
  • Haplotypes from the paper were generated randomly, with equal probabilities for nucleotide selection. We can consider this as a 3 in 4 chance of disagreeing with the pseudo-reference: a 0.75 base mutation rate). The most equivalent subset of our new data consists of metahaplomes with a base mutation rate of “just” 0.25.

Perhaps the most pertinent point here is the last. Without an insane 0.75 mutation rate dataset, it really is quite sketchy to debate how recovery rates of these two data sets should be compared. This said, from the graph we can see:

  • Those 90+% average recoveries I’m missing so badly belong to a very small subset of the original data, with very few SNPs (10-25)
  • There are still recovery rates stretching toward 100%, particularly for the 3 haplotype data set, but for base mutation of 2.5% and above
  • Actually, recovery rates are not so sad overall, considering the significant number of SNPs, particularly for the 5 and 10 haplotype metahaplomes

Recoveries are high for unrealistic variation

Given that a variation rate of 0.75 is incomparable, what’s a sensible amount of variation to concern ourselves with anyway? I ran the numbers on my DHFR and AIMP1 data sets; dividing the number of called variants on my contigs by their total length. Naively distributing the number of SNPs across each haplotype evenly, I found the magic number representing per-haplotype, per-base variation to be around 1.5% (0.015). Of course, that isn’t exactly a vigorous analysis, but perhaps points us in the right direction, if not the correct order of magnitude.

So the jig is up? We report high recovery rates for unnecessarily high variation rates (>2.5%), but our current data sets don’t seem to support the idea that Gretel needs to be capable of recovering from metahaplomes demonstrating that much variation. This is bad news, as conversely, both our megaboxplot and scatter plot show that for rates of 0.5%, Gretel recoveries were not possible in either of the 3 or 5 taxa metahaplomes. Additionally at a level of 1% (0.01), recovery success was mixed in our 3 taxa datasets. Even at the magic 1.5%, for both the 3 and 5 taxa, average recoveries sit uninterestingly between 75% and 87.5%.

Confounding variables are the true source of misery

Even with the feeling that my PhD is going through rapid unplanned disassembly with me still inside of it, I cannot shake off the curious result that increasing the number of taxa in the tree appears to improve recovery accuracy. Each faceted column of the megaboxplot shares elements of the same tree. That is, the 3 taxa 0.1 (or 1%) diversity rate tree, is a subtree of the 15 taxa 0.1 diversity tree. The haplotypes A, B and C, are shared. Yet why does the only reliable way to improve results among those haplotypes seem to be the addition of more haplotypes? In fact, why are the recovery rates of all the 10+ metahaplomes so good, even under per-base variation of half a percent?

We’ve found the trap door, and it is confounding.

Look again at the pretty scatter plot. Notice how the number of called variants increases as we increase the number of haplotypes, for the same level of variation. Notice that it is also possible to actually recover the same A, B, and C haplotype from 3-taxa trees, at low diversity when there are 10 or 15 taxa present.

Recall that each branch of our tree is weighted by the same diversity rate. Thus, when aligned to a pseudo-reference, synthetic reads generated from metahaplomes with more original haplotypes have a much higher per-position probability for containing at least one disagreeing nucleotide in a pileup. i.e. The number of variants is a function of the number of original haplotypes, not just their diversity.

The confounding factor is the influence of Gretel‘s lookback parameter: L. We automatically set the order of the Markov chain used to determine the next nucleotide variant given the last L selected variants, to be equal to the average number of SNPs spanned by all valid reads that populated the Hansel structure. A higher number of called variants in a dataset offers not only more pairwise evidence for Hansel and Gretel to consider (as there are more pairs of SNPs), but also a higher order Markov chain (as there are more pairs of SNPs, on the same read). Thus, with more SNPs, the hypothesis is Gretel has at her disposal, sequences of length L that are not only longer, but more unique to the haplotype that must be recovered.

It seems my counter-intuitive result of more variants and more haplotypes making the problem easier, has the potential to be true.

This theory explains the converse problem of being unable to recover any haplotypes from 3 and 5-taxa trees at low diversity. There simply aren’t enough variants to inform Gretel. After all, at a rate of 0.5%, one would expect a mere 5 variants per 1000bp. Our scatter plot shows for our 3000bp pseudo-reference, at the 0.5% level we observe fewer than 50 SNPs total, across the haplotypes of our 3-taxa tree. Our 150bp reads are not long enough to span the gaps between variants, and Gretel cannot make decisions on how to cross these gaps.

This doesn’t necessarily mean everything is not terrible, but it certainly means the megaboxplot is not only an awful way to demonstrate our results, but probably a poorly designed experiment too. We currently confound the average number of SNPs on reads by observing just the number of haplotypes, and their diversity. To add insult to statistical injury, we then plot them in facets that imply they can be fairly compared. Yet increasing the number of haplotypes, increases the number of variants, which increases the density of SNPs on reads, and improves Gretel‘s performance: we cannot compare the 3-taxa and 15-taxa trees of the same diversity in this way as the 15-taxa tree has an unfair advantage.

I debated with my resident PhD tree pervert about this. In particular, I suggested that perhaps the diversity could be equally split between the branches, such that synthetic read sets from a 3-taxa tree and 15-taxa tree should expect to have the same number of called variants, even if the individual haplotypes themselves have a different level of variation between the trees. Chris argued that whilst that would fix the problem and make the trees more comparable, but going against the grain of simple biological explanations would reintroduce the boilerplate explanation bloat to the paper that we were trying to avoid in the first place.

Around this time I decided to say fuck everything, gave up and wrote a shell for a little while.

Deconfounding the megabox

So where are we now? Firstly, I agreed with Chris. I think splitting the diversity between haplotypes, whilst yielding datasets that might be more readily comparable, will just make for more difficult explanations in our paper. But fundamentally, I don’t think these comparisons actually help us to tell the story of Hansel and Gretel. I thought afterwards, and there are other nasty, unobserved variables in our megaboxplot experiment that directly affect the density of variants on reads, namely: read length and read coverage. We had fixed these to 150bp and 7x coverage for the purpose of our analysis, which felt like a dirty trick.

At this point, bioinformatics was starting to feel like a grand conspiracy, and I was in on it. Would it even be possible to fairly test and describe how our algorithm works through the noise of all of these confounding factors?

I envisaged the most honest method to describe the efficacy of my approach, as a sort of lookup table. I want our prospective users to be able to determine what sort of haplotype recovery rates might be possible from their metagenome, given a few known attributes, such as read length and coverage, at their region of interest. I also feel obligated to show under what circumstances Gretel performs less well, and offer reasoning for why. But ultimately, I want readers to know that this stuff is really fucking hard.

Introducing the all new low-fat less-garbage megaboxplot

Here is where I am right now. I took this lookup idea, and ran a new experiment consisting of some 1500 sets of reads, and runs of Gretel, and threw the results together to make this:

  • Horizontal facets represent synthetic read length
  • Vertical facets are (again) per-haplotype, per-base mutation rates, this time expressed as a percentage (so a rate of 0.01, is now 1%)
  • Colour coded X-axis of each boxplot depicts the average per-haplotype read coverage
  • Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for all of the five haplotypes, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set)

I feel this graph is much more tangible to users and readers. I feel much more comfortable expressing our recovery rates in this format, and I hope eventually our reviewers and real users will agree. Immediately we can see this figure reinforces some expectations, primarily increasing the read length and/or coverage, has a large improvement on Gretel‘s performance. Increasing read length also lowers the requirements on coverage for accuracy.

This seems like a reasonable proof of concept, so what’s next?

  • Generate a significant amount more input data, preferably in a way that doesn’t make me feel ill or depressed
  • Battle with the cluster to execute more experiments
  • Generate many more pretty graphs

I’d like to run this test for metahaplomes with a different number of taxa, just to satisfy my curiosity. I also want to investigate the 1 – 2% diversity region in a more fine grain fashion. Particularly important will be to repeat the experiments with multiple metahaplomes for each read length, coverage and sequence diversity parameter triplet, to randomise away the influence of the tree itself. I’m confident this is the reason for inconsistencies in the latest plot, such as the 1.5% diversity tree with 100bp reads yielding no results (likely due to this particular tree generating haplotypes such that piled up reads contain a pair of variants more than 100bp apart).


Conclusion

  • Generate more fucking metahaplomes
  • Get this fucking paper out

tl;dr

  • I don’t want to be doing this PhD thing in a year’s time
  • I’ve finally started looking again at our glorious rejected pre-print
  • The Trivial haplomes tanked, they were too hard to explain to reviewers and actually don’t provide that much context on Gretel anyway
  • New tree-based datasets have superseded the triviomes2
  • Phylogenetics maybe isn’t so bad (but I’m still not sure)
  • Once again, the cluster and parallelism in general has proven to be the bane of my fucking life
  • It can be quite difficult to present results in a sensible and meaningful fashion
  • There are so many confounding factors in analysis and I feel obligated to control for them all because it feels like bad science otherwise
  • I’m fucking losing it lately
  • Playing spaceships in space is great but don’t expect to not be blown out of fucking orbit just because you are trying to have a nice time
  • I really love ggplot2, even if the rest of R is garbage
  • I’ve been testing Gretel at “silly” levels of variation thinking that this gives proof that we are good at really hard problems, but actually more variation seems to make the problem of recovery easier
  • 1.5% per-haplotype per-base mutation seems to be my current magic number (n=2, because fuck you)
  • I wrote a shell because keeping track of all of this has been an unmitigated clusterfuck
  • I now have some plots that make me feel less like I want to jump off something tall
  • I only seem to enjoy video games that have plenty of micromanagement that stress me out more than my PhD
  • I think Bioinformatics PhD Simulator 2018 would make a great game
  • Unrealistic testing cannot give realistic answers
  • My supervisor, Chris is a massive dendrophile3
  • HR bullshit makes a grumpy PhD student much more grumpy
  • This stuff, is really fucking hard

  1. supplant HAH GET IT 
  2. superseeded HAHAH I AM ON FIRE 
  3. phylogenphile? 
]]>
https://samnicholls.net/2016/12/19/status-nov16-p1/feed/ 0 1720
Bioinformatics is a disorganised disaster and I am too. So I made a shell. https://samnicholls.net/2016/11/16/disorganised-disaster/ https://samnicholls.net/2016/11/16/disorganised-disaster/#respond Wed, 16 Nov 2016 17:50:59 +0000 https://samnicholls.net/?p=1581 If you don’t want to hear me wax lyrical about how disorganised I am, you can skip ahead to where I tell you about how great the pseudo-shell that I made and named chitin is.

Back in 2014, about half way through my undergraduate dissertation (Application of Machine Learning Techniques to Next Generation Sequencing Quality Control), I made an unsettling discovery.

I am disorganised.

The discovery was made after my supervisor asked a few interesting questions regarding some of my earlier discarded analyses. When I returned to the data to try and answer those questions, I found I simply could not regenerate the results. Despite the fact that both the code and each “experiment” were tracked by a git repository and I’d written my programs to output (what I thought to be) reasonable logs, I still could not reproduce my science. It could have been anything: an ad-hoc, temporary tweak to a harness script, a bug fix in the code itself masking a result, or any number of other possible untracked changes to the inputs or program parameters. In general, it was clear that I had failed to collect all pertinent metadata for an experiment.

Whilst it perhaps sounds like I was guilty of negligent book-keeping, it really wasn’t for lack of trying. Yet when dealing with many interesting questions at once, it’s so easy to make ad-hoc changes, or perform undocumented command line based munging of input data, or accidentally run a new experiment that clobbers something. Occasionally, one just forgets to make a note of something, or assumes a change is temporary but for one reason or another, the change becomes permanent without explanation. These subtle pipeline alterations are easily made all the time, and can silently invalidate swathes of results generated before (and/or after) them.

Ultimately, for the purpose of reproducibility, almost everything (copies of inputs, outputs, logs, configurations) was dumped and tar‘d for each experiment. But this approach brought problems of its own: just tabulating results was difficult in its own right. In the end, I was pleased with that dissertation, but a small part of me still hurts when I think back to the problem of archiving and analysing those result sets.

It was a nightmare, and I promised it would never happen again.

Except it has.

A relapse of disorganisation

Two years later and I’ve continued to be capable of convincing a committee to allow me to progress towards adding the title of doctor to my bank account. As part of this quest, recently I was inspecting the results of a harness script responsible for generating trivial haplotypes, corresponding reads and attempting to recover them using Gretel. “Very interesting, but what will happen if I change the simulated read size”, I pondered; shortly before making an ad-hoc change to the harness script and inadvertently destroying the integrity of the results I had just finished inspecting by clobbering the input alignment file used as a parameter to Gretel.

Argh, not again.

Why is this hard?

Consider Gretel: she’s not just a simple standalone tool that one can execute to rescue haplotypes from the metagenome. One must go through the motions of pushing their raw reads through some form of pipeline (pictured below) to generate an alignment (to essentially give a co-ordinate system to those reads) and discover the variants (the positions in that co-ordinate system that relate to polymorphisms on reads) that form the required inputs for the recovery algorithm first.

This is problematic for one who wishes to be aware of the providence of all outputs of Gretel, as those outputs depend not only on the immediate inputs (the alignment and called variants), but the entirety of the pipeline that produced them. Thus we must capture as much information as possible regarding all of the steps that occur from the moment the raw reads hit the disk, up to Gretel finishing with extracted haplotypes.

But as I described in my last status report, these tools are themselves non-trivial. bowtie2 has more switches than an average spaceship, and its output depends on its complex set of parameters and inputs (that also have dependencies on previous commands), too.

img_20161110_103257

bash scripts are all well and good for keeping track of a series of commands that yield the result of an experiment, and one can create a nice new directory in which to place such a result at the end – along with any log files and a copy of the harness script itself for good measure. But what happens when future experiments use different pipeline components, with different parameters, or we alter the generation of log files to make way for other metadata? What’s a good directory naming strategy for archiving results anyway? What if parts (or even all of the) analysis are ad-hoc and we are left to reconstruct the history? How many times have you made a manual edit to a malformed file, or had to look up exactly what combination of sed, awk and grep munging you did that one time?

One would have expected me to have learned my lesson by now, but I think meticulous digital lab book-keeping is just not that easy.

What does organisation even mean anyway?

I think the problem is perhaps exacerbated by conflating the meaning of “organisation”. There are a few somewhat different, but ultimately overlapping problems here:

  • How to keep track of how files are created
    What command created file foo? What were the parameters? When was it executed, by whom?
  • Be aware of the role that each file plays in your pipeline
    What commands go on to use file foo? Is it still needed?
  • Assure the ongoing integrity of past and future results
    Does this alignment have reads? Is that FASTA index up to date?
    Are we about to clobber shared inputs (large BAMS, references) that results depend on?
  • Archiving results in a sensible fashion for future recall and comparison
    How can we make it easy to find and analyse results in future?

Indeed, my previous attempts at organisation address some but not all of these points, which is likely the source of my bad feeling. Keeping hold of bash scripts can help me determine how files are created, and the role those files go on to play in the pipeline; but results are merely dumped in a directory. Such directories are created with good intent, and named something that was likely useful and meaningful at the time. Unfortunately, I find that these directories become less and less useful as archive labels as time goes on… For example, what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100/1?

This approach also had no way to assure the current and future integrity of my results. Last month I had an issue with Gretel outputting bizarrely formatted haplotype FASTAs. After chasing my tail trying to find a bug in my FASTA I/O handling, I discovered this was actually caused by an out of date FASTA index (.fai) on the master reference. At some point I’d exchanged one FASTA for another, assuming that the index would be regenerated automatically. It wasn’t. Thus the integrity of experiments using that combination of FASTA+index was damaged. Additionally, the integrity of the results generated using the old FASTA were now also damaged: I’d clobbered the old master input.

There is a clear need to keep better metadata for files, executed commands and results, beyond just tracking everything with git. We need a better way to document the changes a command makes in the file system, and a mechanism to better assure integrity. Finally we need a method to archive experimental results in a more friendly way than a time-sensitive graveyard of timestamps, acronyms and abbreviations.

So I’ve taken it upon myself to get distracted from my PhD to embark on a new adventure to save myself from ruining my PhD2, and fix bioinformatics for everyone.

Approaches for automated command collection

Taking the number of post-its attached to my computer and my sporadically used notebooks as evidence enough to outright skip over the suggestion of a paper based solution to these problems, I see two schools of thought for capturing commands and metadata computationally:

  • Intrusive, but data is structured with perfect recall
    A method whereby users must execute commands via some sort of wrapper. All commands must have some form of template that describes inputs, parameters and outputs. The wrapper then “fills in” the options and dispatches the command on the user’s behalf. All captured metadata has uniform structure and nicely avoids the need to attempt to parse user input. Command reconstruction is perfect but usage is arguably clunky.
  • Unobtrusive, best-effort data collection
    A daemon-like tool that attempts to collect executed commands from the user’s shell and monitor directories for file activity. Parsing command parameters and inputs is done in a naive best-effort scenario. The context of parsed commands and parameters is unknown; we don’t know what a particular command does, and cannot immediately discern between inputs, outputs, flags and arguments. But, despite the lack of structured data, the user does not notice our presence.

There is a trade-off between usability and data quality here. If we sit between a user and all of their commands, offering a uniform interface to execute any piece of software, we can obtain perfectly structured information and are explicitly aware of parameter selections and the paths of all inputs and desired outputs. We know exactly where to monitor for file system changes, and can offer user interfaces that not only merely enumerate command executions, but offer searching and filtering capabilities based on captured parameters: “Show me assemblies that used a k-mer size of 31”.

But we must ask ourselves, how much is that fine-grained data worth to us? Is exchanging our ability to execute commands ourselves worth the perfectly structured data we can get via the wrapper? How much of those parameters are actually useful? Will I ever need to find all my bowtie2 alignments that used 16 threads? There are other concerns here too: templates that define a job specification must be maintained. Someone must be responsible for adding new (or removing old) parameters to these templates when tools are updated. What if somebody happens to misconfigure such a template? More advanced users may be frustrated at being unable to merely execute their job on the command line. Less advanced users could be upset that they can’t just copy and paste commands from the manual or biostars. What about smaller jobs? Must one really define a command template to run trivial tools like awk, sed, tail, or samtools sort through the wrapper?

It turns out I know the answer to this already: the trade-off is not worth it.

Intrusive wrappers don’t work: a sidenote on sunblock

Without wanting to bloat this post unnecessarily, I want to briefly discuss a tool I’ve written previously, but first I must set the scene3.

Within weeks of starting my PhD, I made a computational enemy in the form of Sun Grid Engine: the scheduler software responsible for queuing, dispatching, executing and reporting on jobs submitted to the institute’s cluster. I rapidly became frustrated with having an unorganised collection of job scripts, with ad-hoc edits that meant I could no longer re-run a job previously executed with the same submission script (does this problem sound familiar?). In particular, I was upset with the state of the tools provided by SGE for reporting on the status of jobs.

To cheer myself up, I authored a tool called sunblock, with the goal of never having to look at any component of Sun Grid Engine directly ever again. I was successful in my endeavour and to this day continue to use the tool on the occasion where I need to use the cluster.

screenshot-from-2016-11-16-16-11-11

However, as hypothesised above, sunblock does indeed require an explicit description of an interface for any job that one would wish to submit to the cluster, and it does prevent users from just pasting commands into their terminal. This all-encompassing wrapping feature; that allows us to capture the best, structured information on every job, is also the tool’s complete downfall. Despite the useful information that could be extracted using sunblock (there is even a shiny sunblock web interface), its ability to automatically re-run jobs and the superior reporting on job progress compared to SGE alone, was still not enough to get user traction in our institute.

For the same reason that I think more in-the-know bioinformaticians don’t want to use Galaxy, sunblock failed: because it gets in the way.

Introducing chitin: an awful shell for awful bioinformaticians

Taking what I learned from my experimentation with sunblock on-board, I elected to take the less intrusive, best-effort route to collecting user commands and file system changes. Thus I introduce chitin: a Python based tool that (somewhat)-unobtrusively wraps your system shell, to keep track of commands and file manipulations to address the problem of not knowing how any of the files in your ridiculously complicated bioinformatics pipeline came to be.

I initially began the project with a view to create a digital lab book manager. I envisaged offering a command line tool with several subcommands, one of which could take a command for execution. However as soon as I tried out my prototype and found myself prepending the majority of my commands with lab execute, I wondered whether I could do better. What if I just wrapped the system shell and captured all entered commands? This might seem a rather dumb and long-about way of getting one’s command history, but consider this: if we wrap the system shell as a means to capture all the input, we are also in a position to capture the output for clever things, too. Imagine a shell that could parse the stdout for useful metadata to tag files with…

I liked what I was imagining, and so despite my best efforts to get even just one person to convince me otherwise; I wrote my own pseudo-shell.

chitin is already able to track executed commands that yield changes to the file system. For each file in the chitin tree, there is a full modification history. Better yet, you can ask what series of commands need to be executed in order to recreate a particular file in your workflow. It’s also possible to tag files with potentially useful metadata, and so chitin takes advantage of this by adding the runtime4, and current user to all executed commands for you.

Additionally, I’ve tried to find my own middle ground between the sunblock-esque configurations that yielded superior metadata, and not getting in the way of our users too much. So one may optionally specify handlers that can be applied to detected commands, and captured stdout/stderr. For example, thanks to my bowtie2 configuration, chitin tags my out.sam files with the overall alignment rate (and a few targeted parameters of interest), automatically.

screenshot-from-2016-11-16-17-21-30

chitin also allows you to specify handlers for particular file formats to be applied to files as they are encountered. My environment, for example, is set-up to count the number of reads inside a BAM, and associate that metadata with that version of the file:

screenshot-from-2016-11-16-17-30-55

In this vein, we are in a nice position to check on the status of files before and after a command is executed. To address some of my integrity woes, chitin allows you to define integrity handlers for particular file formats too. Thus my environment warns me if a BAM has 0 reads, is missing an index, or has an index older than itself. Similarly, an empty VCF raises a warning, as does an out of date FASTA index. Coming shortly will be additional checks for whether you are about to clobber a file that is depended on by other files in your workflow. Kinda cool, even if I do say so myself.

Conclusion

Perhaps I’m trying to solve a problem of my own creation. Yet from a few conversations I’ve had with folks in my lab, and frankly, anyone I could get to listen to me for five minutes about managing bioinformatics pipelines, there seems to be sympathy to my cause. I’m not entirely convinced myself that a “shell” is the correct solution here, but it does seem to place us in the best position to get commands entered by the user, with the added bonus of getting stdout to parse for free. Though, judging by the flurry of Twitter activity on my dramatically posted chitin screenshots lately, I suspect I am not so alone in my disorganisation and there are at least a handful of bioinformaticians out there who think a shell isn’t the most terrible solution to this either. Perhaps I just need to be more of a wet-lab biologist.

Either way, I genuinely think there’s a lot of room to do cool stuff here, and to my surprise, I’m genuinely finding chitin quite useful already. If you’d like to try it out, the source for chitin is open and free on GitHub. Please don’t expect too much in the way of stability, though.


tl;dr

  • A definition of “being organised” for science and experimentation is hard to pin down
  • But independent of such a definition, I am terminally disorganised
  • Seriously what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd1001
  • I think command wrappers and platforms like Galaxy get in the way of things too much
  • I wrote a “shell” to try and compensate for this
  • Now I have a shell, it is called chitin

  1. This is a genuine directory in my file system, created about a month ago. It contains results for a run of Gretel against the pol gene on the HIV genome (2084-5083). Off the top of my head, I cannot recall what sd100 is, or why reg appears before the base positions. I honestly tried. 
  2. Because more things that are not my actual PhD is just what my PhD needs. 
  3. If it helps you, imagine some soft jazz playing to the sound of rain while I talk about this gruffly in the dark with a cigarette poking out of my mouth. Oh, and everything is in black and white. It’s bioinformatique noir
  4. I’m quite pleased with this one, because I pretty much always forget to time how long my assemblies and alignments take. 
]]>
https://samnicholls.net/2016/11/16/disorganised-disaster/feed/ 0 1581
Interdisciplinary talks and the metaphor-ome: Harder than metagenomics itself? https://samnicholls.net/2016/11/03/talks-and-metaphors/ https://samnicholls.net/2016/11/03/talks-and-metaphors/#respond Thu, 03 Nov 2016 21:18:12 +0000 https://samnicholls.net/?p=1390 Yesterday I spoke at the Centre of Computational Biology at Birmingham University. I was invited to give a talk as part of their research seminar series about the work I have been doing on metagenomes. The lead up to this has been pretty nerve-wracking as this was my first talk outside of Aberystwyth (since my short introductory talk at KU Leuven last year), and the majority of my previous talks have been to my peers, which I find to be a lot less intimidating than a room full of experts of various fields.

Metaphorical Metagenomes

I submitted the current working title of my PhD: “Extracting exciting exploitable enzymes from massive metagenomes“, which I think is a rather catchy summary of what I’m working on here. I borrowed the opening slides from my previous talks (this is a cow…) but felt like I needed to try a new explanation of why the metagenome is so difficult to work with. Previously, I’ve described the problem with jigsaw puzzles: i.e. consider many distinct (but visually similar) jigsaws, mixed together (with duplicate and missing pieces). Whilst this is a nice, accessible description that appears to serve well, it tends to leave some listeners confused about my objective, particularly:

  • You are recovering whole genomes?
    The jigsaw metaphor doesn’t lend well to the metahaplome and the concept of assembling a jigsaw partially. Listeners assume we want to piece together each of the different jigsaws in our box, whole – presumably because people find those who don’t finish jigsaws terrible.
  • We can assemble jigsaws incorrectly?
    Metagenomic assemblies are a consensus sequence of the observed reads. The resulting sequence is unlikely to exist in nature. Whilst we can extend our metaphor to explain that pieces of jigsaws may have the same shape, such that we can put together puzzles that don’t exist, this is not immediately obvious to listeners.

A common analogy for genomic assembly is that of pages shredded from a book. I’ve also previously pitched this at a talk to try and explain metagenomic assembly, but this has some disastrous metaphorical pitfalls too:

  • You are recovering whole books?
    Akin to the jigsaw analogy, listeners don’t immediately see why we would only want to assemble parts of a book. What part? A chapter? A page? A paragraph? Which leads to…
  • Why are there paragraphs shared between books?
    To describe our problem of recovering genes that appear across multiple species, we must say that we are attempting to recover some shared sequence of words from across many books. This somewhat breaks the metaphor as this isn’t a problem that exists, and so the concept just causes listener confusion, rather than helping them to understand our problem. Whilst we could point out the Bible as an example of a book that has been translated and shared to a point where two copies of the text do feature differences between their passages, we figure it best to avoid conversations about the Bible and shredded books.
  • You are assembling words into sentences? The problem is easy?
    DNA has a limited alphabet: A, C, G and T. But books can contain a practically infinite combination of character sequences given their larger alphabets. This larger alphabet makes distinguishing sequence similarity incredibly simple compared to that of DNA. Right now I’m using an alphabet of about 95 characters (upper and lowercase characters, numbers and a subset of symbols) in this post, and although it’s possible that one or more of my sentences could appear elsewhere on the web (unintentionally), the probability of this will be many, many orders of magnitude smaller than that of finding duplication of DNA sequences within and between genomes. Thus by comparing the problem to reconstructing pages from a book, we are at a very real risk of underselling the difficulty of the problem at hand.

Additionally, both analogies fail to explain:

  • Intra-object variation
    We must also shoehorn the concept of intraspecies gene variation into these metaphors which turns out rather clunky. We do say that books and jigsaws have misprints and errors, but this doesn’t truly emphasise that there is real variation between instances of the same object.
  • What is the biological significance anyway?
    Neither description of the problem comes close to explaining why we’d even want to retrieve the different but similar-looking areas of a jigsaw, or copies of a page or passage shared across multiple books.

Machines and Factories: A new metaphor?

So, I spent some time iterating over many ideas and came up with a new concept of “genes as machines” and “genomes as factories”:

Genes

Consider a gene as a physical machine. It’s configuration is set by altering the state of its switches. The configuration of a machine is akin to a sequence of DNA. It is possible (and even intended) that the machine can be configured in many different ways by changing the state of its switches (like gene variants), but it is still the same machine (the same gene). This is an important concept because we want to describe that a machine can have many configurations (that can do nothing, improve performance, or even break it), whilst still remaining the same machine (i.e. a variant of a gene).

factories-02

Factories

We can consider a genome as a factory, holding a collection of machines and their configurations:

factories-07

We can extend this metaphor to groups of factories under a parent organisation (i.e. a species) who can set the configuration of their machines autonomously – introducing intra-species variation as a concept. Additionally we can describe groups of factories under other parent organisations (species) that also deploy the same machine to their own factories, also configuring them differently – introducing not only intra-species variation, but multiple sets of intra-species variants too:

factories-09

Talk the Talk

Armed with my shiny diagrams and apprehension of my own new metaphor, I pitched it to my listeners as a test and thanked them for their role as guinea pigs to my new attempt at explaining the core concept of the metagenome and its problems.

In general, I felt like the audience followed along with the metaphor to begin with. Given a fictional company: Enzytech and their machine: Awesomease, we could define the metahaplome as the collection of configurations of that Enzytech Awesomease product across multiple factories, under various parent companes (i.e. different genomes, of varying species). However I think the story unravelled when I described the process of actually recovering the metahaplome.

I set the scene: Sam from Enzytech wondered why factories configured their Awesomease differently. Sam figured there must be an interesting meaning to these configurations – do some combinations of switches cause the Awesomease to -ase more awesomely? Thus, Sam approaches each parent company and requests their Enzytech Awesomease configurations. In a surprising gesture of co-operation, the businesses comply and return all their Enzytech Awesomease configurations, for all of their factories. Unfortunately, and perhaps in breach of their own trade secrets, they also submit the configurations of every other machine (gene) in each of their factories (genomes) too:

factories-21

To make matters worse, the configurations don’t describe the specific factory they are from (i.e. the individual organism), and their returned documents also include incomplete, broken and duplicated configurations. Lost configurations are not submitted.

I think at this point, I was getting too wrapped up in the metaphor and its story. The concept of metaphorical factories submitting bad paperwork to fictional Sam from Enzytech did not have an obvious biological reference point (it was supposed to describe metagenomic sampling). I think with practice, I could deliver this part better such that my audience understands the relevance to biology, but I am not sure it is necessary. Where things definitely did not work was this slide:

factories-13

“Unfortunately, an Enzytech intern misfiled the paperwork submitted by all of the parent companies’ factories (species and their various genomes), and we could no longer determine which company submitted which configuration. The same clumsy intern then accidentally shredded all of the configurations, too.”

Welp. I am somewhat cringing at the amount of biological shoehorning going on in just one slide. Here I wanted to explain that although my pretty pictures have helpful colour coding for each of the companies (species), we don’t have this in our data. A configuration (gene variant) could come from any factory (genome) in our sample, and there is no way of differentiating them. Although shredding is a (common) reference to short-read sequencing technology, the delivery of this slide feels as clumsy as the Enzytech intern. I think the mistake I have made here was trying to use the same metaphorical explanation for two separate and distinct problems that I face in my work on metagenomes:

  • The metahaplome
    We need to clearly define what the metahaplome actually is as it is a term we coined. It is also the objective of my algorithm, and so failing to adequately describe this means it is unclear why this work has a biological relevance (or is worth a PhD).
  • Metagenomes, assembly, and short read sequencing
    This final slide attempts to describe metagenomes and sequencing, as shredded paperwork relating to many different genes, from multiple factories that are held by various parent companies, all mixed together. But in fact, for this part of the metaphor it is easier to just say “bits of DNA, from a gene, on multiple organisms, from multiple species in the same environmental sample”…

On this occasion, I believe I managed to explain the metahaplome more clearly to an audience than ever before, though this might be in part because this is my first talk since our pre-print. However, in forcing my new metaphor onto the latter problem (of sequencing), I think I inadvertently convoluted what the metagenome is. So ultimately, I’m not entirely convinced the new metaphor panned out with a mixed audience of expert computer scientists and biologists. That said, I had several excellent questions following the talk, that seemed to imply a deep understanding of the work I presented, so hooray! Regardless of whether I deploy it for my next talk, I think it will still prove a nice way to explain my work to the public at large (who may have no frame of reference to get confused with).

I enjoyed the opportunity to visit somewhere new and speak about my work, especially as my first invited talk outside of Aberystwyth. This is also a reminder that even sharing thoughts and progress on cross-discipline work is hard. It’s a lot of work to come up with a way to get everyone in the audience on the same page; capable of speaking the same language (biological, mathematical and computational terminology) and also give the necessary background knowledge (genomic sequencing and metagenomes) to even begin to pitch the novelty and importance of our own work.


Obligatory proof that people attended:

Obligatory omg my heart rate:

screenshot-from-2016-11-03-16-05-09


tl;dr

  • I was invited to speak at Birmingham, it was nice
  • It’s super hard to come up with explanations of your work that will please everyone
  • Spending until 4am drawing some rather shiny diagrams is perhaps not the best reason to push forth with a new metaphor that even you feel a little uneasy about
  • I continue to speak too bloody quickly
  • My body still gives the physiological impression I am doing exercise whilst speaking publicly
]]>
https://samnicholls.net/2016/11/03/talks-and-metaphors/feed/ 0 1390
Status Report: Jul Aug September 2016 https://samnicholls.net/2016/10/24/status-sep16/ https://samnicholls.net/2016/10/24/status-sep16/#respond Mon, 24 Oct 2016 15:37:02 +0000 https://samnicholls.net/?p=1201 A sufficient time had passed since my previous report such that there were both a number of things to report, and I crossed the required threshold of edginess to provide an update on the progress of the long list of things I am supposed to be doing in my quest to become Dr. Nicholls. I began a draft of this post shortly before I attended the European Conference on Computational Biology at the start of September. However at the end of the conference, I spontaneously decided to temporarily abort my responsibilities to bioinformatics, not return to Aberystwyth, electing to spend a few weeks traversing Eastern Europe instead. I imagine there will be more on this in a future post, but for now let’s focus on the long overdue insight into the work that I am supposed to be doing.

In our last installment, I finally managed to describe in words how the metahaplome is somewhat like a graph, but possesses properties that also make it not like a graph. The main outstanding issues were related to reweighting evidence in the metahaplome structure, the ongoing saga of generating sufficient data sets for evaluation and writing this all up so people believe us when we say it is possible. Of the many elephants in the room, the largest was my work still being described as the metahaplome or some flavour of Sam’s algorithm. It was time to try and conquer one of the more difficult problems in computer science: naming things.

Introducing Hansel and Gretel

Harbouring a dislike for the apparent convention in bioinformatics that software names should be unimaginative1 or awkwardly constructed acronyms, and spotting an opportunity to acquire a whimsical theme, I decided to follow in the footsteps of Goldilocks, and continue the fairy tale naming scheme; on the condition I could find a name that fit comfortably.

Unsurprisingly, this proved quite difficult. After some debate, the most suitable fairy tale name I could find was Rumpelstiltskin, for its ‘straw-to-gold’ reference. I liked the idea of your straw-like reads being converted to golden haplotypes. However as I am not remotely capable of spelling the name without Google, and the link between the name and software feels a tad tenuous, I vetoed the option and forged forward with the paper, with the algorithm Untitled.

As I considered the logistics of packaging the implementation as it stood, I realised that I had essentially created both a novel data structure for storing the metahaplome, as well as an actual algorithm for utilising that information to recover haplotypes from a metagenome. At this point everything fell into place; a nice packaging solution and a fitting pair of names had resolved themselves. Thus, I introduce the Hansel data structure, and the Gretel algorithm; a framework for recovering real haplotypes from metagenomes.

Hansel

Hansel is a Python package that houses our novel data structure. I describe it as a “graph-inspired data structure for determining likely chains of symbol sequences from crummy evidence”. Hansel is designed to work with counts of pairwise co-occurrences of symbols in space or time. For our purposes, those symbols are the chemical bases of DNA (or RNA, or perhaps even amino acids of proteins), and the “space” dimension is their position along some sequence.

Three corresponding representations, (top) aligned reads, (middle) the Hansel structure, (bottom) a graph that can be derived from the Hansel structure

We fill this structure with counts of the number of times we observe some pair of nucleotides, at some pair of positions on the same read. Hansel provides a numpy ndarray backed class that offers a user friendly API for operating on our data structure: including adding, adjusting and fetching the counts of pairwise co-occurrences of pairs of symbols in space and time, and making probabilistic queries on the likelihood of symbols occurring, given some sequence of observed symbols thus far.

Gretel

Gretel is a Python package that now houses the actual algorithm that exploits the spun-out API offered by Hansel to attempt to recover haplotypes from the metagenome. Gretel provides a command line based tool that accepts a BAM of aligned reads (against a metagenomic pseudo-reference, typically the assembly) and a VCF of single nucleotide polymorphisms.

Gretel populates the Hansel matrix with observations by parsing the reads in the BAM at the genomic positions described in the provided VCF. Once parsing the reads is complete, Gretel exploits the ability to traverse the Hansel structure like a graph, creating chains of nucleotides that represent the most likely sequence of SNPs that appear on reads that in turn align to some region of interest on the pseudo-reference.

At each node, the decision to move to the next node (i.e. nucleotide) is made by ranking each of the probabilities of the node on the end of each current outgoing edge appearing after some subset of the previously seen nodes (i.e. the current path). Thus both the availability and associated probabilities of future edges are dependent on not only the current or previous node, but the path itself.

Pairwise conditionals between L last variants on the observed path, and each of the possible next variants are calculated and the best option (highest likelihood) is chosen

Pairwise conditionals between L last variants on the observed path,
and each of the possible next variants are calculated and the
best option (highest likelihood) is chosen

Gretel will construct a path in this way until a dummy sentinel representing the end of the graph is reached. After a path has been constructed from the dummy (or “sentinel”) start node to the end, observations in the Hansel structure are reweighted; to ensure that the same path is not returned again (as traversal is deterministic) and to allow Gretel to return the next most likely haplotype instead.

Gretel repeatedly attempts to traverse the graph, each time returning the next most likely haplotype given the reweighted Hansel structure, until a node with no outgoing edges is encountered (i.e. all observations between two genomic positions have been culled by re-weighting).

Going Public

After surgically separating Hansel from Gretel, the codebases were finally in a state where I wasn’t sick at the thought of anyone looking at them. Thus Hansel and Gretel now have homes on Github: see github.com/samstudio8/hansel and github.com/samstudio8/gretel. In an attempt to not be an awful person, partial documentation now also exists on ReadTheDocs: see hansel.readthedocs.io and gretel.readthedocs.io. Hansel and Gretel are both open source and distributed under the MIT Licence.

Reweighting

For the time being, I am settled on the methodology for reweighting the Hansel matrix as described in February’s status report, with the only recent difference being that it is now implemented correctly. Given a path through the metahaplome found by Gretel (that is, a predicted haplotype), we consider the marginal distribution of each of the selected variants that compose that path, and select the smallest (i.e. the least likely variant to be selected across all SNPs). That probability is then used as a ratio to reduce the observations in the Hansel data structure for all pairs of variants on the path.

As I have mentioned in the past, reweighting requires care: an overly aggressive methodology would dismiss similar looking paths (haplotypes) that have shared variants, but an overly cautious algorithm would limit exploration of the metahaplome structure and inhibit recovery of real haplotypes.

Testing so far has indicated that out of the methods I have proposed, this is the most robust, striking a happy balance between exploration-potential and accuracy. At this time I’m happy to leave it as is and work on something else, before I break it horribly.

The “Lookback” Parameter

As the calculation of edge likelihoods depends on a subset of the current path, a keen reader might wonder how such a subset is defined. Your typical bioinformatics software author would likely leave this as an exercise:
-l, --lookback [mysteriously important non-optional integer that fundamentally changes how well this software works, correct selection is a dark art, never set to 0, good luck.]

Luckily for both bioinformatics and my viva, I refuse to inflict the field’s next k-mer size, and there exists a reasonable intuition for the selection of L: the number of nodes over which to lookback (from and including the head of the current path) when considering the next node (nucleotide of the sequence). My intuition is that sequenced read fragments cover some limited number of SNP sites. Thus there will be some L after which it is typically unlikely that pairs of SNPs will co-occur on the same read between the next variant node i+1 and the already observed node (i+1) - L.

I say typically here because in actuality, SNPs will not be uniformly distributed2 and as such the value of L according to this intuition is going to vary across the region of interest depending on both the read sizes, mate pair insert sizes, coverage and indeed the distribution of the SNPs themselves. We thus consider the average number of SNPs covered by all reads parsed from the BAM, and use this to set the value of L.

Smoothing

Smoothing is a component of Hansel that has flown under the radar in my previous discussions. This is likely because I have not given it a second thought since being Belgian last year. Smoothing attempts to awkwardly sidestep two potential issues, namely overfitting and ZeroDivisionError.

For the former, we want to avoid scenarios where variant sites with very low read coverage (and thus few informative observations) are assumed to be fully representative of the true variation. In this case, smoothing effectively inserts additional observations that were not actually present in the provided reads to attempt to make up for potentially unseen observations.

For the latter case, consider the conditional probability of symbol a at position i occurring, given some symbol b at position j. This is defined (before smoothing is applied) by the Hansel framework as:

\frac{\text{Reads featuring} ~a~ \text{at} ~i~ \text{and} ~b~ \text{at} ~j}{\text{Reads spanning} ~i~ \text{and feature} ~b~ \text{at} ~j}

If one were to query the Hansel API with some selection of i, b and j such that there are no reads spanning i that feature symbol b at position j, a ZeroDivisionError will be raised. This is undesirable and cannot be circumvented by simply “catching” the error and returning 0, as the inclusion of a probability of 0 in a sequence of products renders the entire sequence of probabilities as 0, too.

The current smoothing method is merely “add one smoothing”, which modifies the above equation to artificially insert one (more) observation for each possible combination of symbols a and b between i and j. This avoids division by zero as there will always be at least one valid observation. However I suspect that to truly address the former problem, a more sophisticated solution is necessary.

Fortunately, it appears that in practice, for metagenomes with reasonable coverage, the first problem falls away and smoothing has a negligible effect on evaluation of edge probabilities. Despite this, the method of smoothing employed is admittedly naive and future work could potentially benefit from replacement. It should be noted that the influence of smoothing has the potential to become particularly pronounced after significant reweighting of the Hansel matrix (i.e. when very few observations remain).

Evaluation

Avid readers will be aware that evaluation of this method has been a persistent thorn in my side since the very beginning. There are no metagenomic test data sets that offer raw sequence reads, an assembly of those reads and a set of all expected (or indeed, even just some) haplotypes for a gene known to truly exist in that metagenome. Clearly this makes evaluation of a method for enumerating the most likely haplotypes of a particular gene from sequenced metagenomic reads somewhat difficult.

You might remember from my last status report that generating my own test data sets for real genes was both convoluted, and fraught with inconsistencies that made it difficult to determine whether unrecovered variants were a result of my approach, or an artefact of the data itself. I decided to take a step back and consider a simpler form of the problem, in the hope that I could construct an adequate testing framework for future development, and provide an empirical proof that the approach works (as well as a platform to investigate the conditions under which it doesn’t).

Trivial Haplomes: Triviomes

To truly understand and diagnose the decisions my algorithm was making, I needed a source of reads that could be finely controlled and well-defined. I also needed the workflow to generate those reads to be free of uncontrolled external influences (i.e. reads dropped by alignment, difficult to predict SNP calling).

To accomplish this I created a script to construct trivial haplomes: sets of short, randomly generated haplotypes, each of the same fixed length. Every genomic position of such a trivial haplotype was considered to be a site of variation (i.e. a SNP). Tiny reads (of a configurable size range, set to 3-5bp for my work) are then constructed by sliding windows across each of the random haplotypes. Additional options exist to set a per-base error rate and “slip” the window (to reduce the quality of some read overlaps, decreasing the number of shared paired observations).

Although this appears to be grossly unrepresentative of the real problem at hand — what technology even generates reads of 3-5bp? Don’t forget, Hansel and Gretel are designed to work directly with SNPs. Aligned reads are only parsed at positions specified in the VCF (i.e. the list of called variants) and so real sequences collapse to a handful of SNPs anyway. The goal here is not so much about accurately emulating real reads with real variation and error, but to establish a framework for controlled performance testing under varying conditions (e.g. how do recovery rates change with respect to alignment rate, error rate, number of SNPs, haplotypes etc).

We must also isolate the generation of input files (alignments and variant lists) from external processes. That is, we cannot use established tools for read alignment and variant calling. Whilst the results of these processes are tractable and deterministic, they confound the testing of my triviomes due to their non-trivial behaviour. For example, my tiny reads have a known correct alignment: each read is yielded from some window with a defined start and end position on a given haplotype. However read aligners discard, clip and “incorrectly” align these tiny test reads3. My reads are no longer under my direct control.

In the same fashion, despite my intention that every genomic position of my trivial haplome is a SNP, established variant callers and their diploid assumptions can simply ignore, or warp the calling of tri- or tetra-alleleic positions.

Thus my script is responsible for generating a SAM file, describing the alignment of the generated reads against a reference that does not exist, and a VCF, which simply enumerates all genomic positions on the triviome as a potential variant.
It’s important to note here that for Gretel (and recovery in general) the actual content of the reference sequence is irrelevant: the job of the reference (or pseudo-reference, as I have taken to call it for metagenomes) is to provide a shared co-ordinate system for the sequenced reads via alignment. In this case, the co-ordinates of the reads are known (we generated them!) and so the process of alignment is redundant and the reference need not exist at all.

Indeed, this framework has proved valuable. A harness script now rapidly and repeatedly generates hundreds of triviomes. Each test creates a number of haplotypes, with some number of SNPs. The harness can also specify an error rate, and how to sample the reads from each generated haplotype (uniformly, exponentially etc.). The resulting read alignment and variant list is thrown at Hansel and Gretel as part of the harness. Haplotypes recovered and output by Gretel can then be compared to the randomly generated input sequences for accuracy by merely constructing a matrix of Hamming distances (i.e. how many bases do not match for each output-input pair?). This is simple, fast and pretty effective, even if I do say so myself.

Despite its simplicity, this framework forms a basis for testing the packages during development, as well as giving us a platform on which to investigate the influence on recovery rate that parameters such as read length, number of haplotypes, number of SNPs, error rate, alignment rate have. Neat!

Synthetic Metagenomes

Of course, this triviome stuff is all well and good, but we need to prove that recovery is also possible on real genomic data. So we still left executing the rather convoluted looking workflow that I left you with at towards the end of my last report?

On the surface, that appears to be the case. Indeed, we must still simulate reads from some set of related genes, align those reads to some pseudo-reference, call for SNPs on that alignment and use the alignment and those called SNPs as the input to Gretel to recover the original genes that we generated reads from in the first place. Of course, we must also compare the haplotypes returned by Gretel to the known genes to evaluate the results.

But actually, the difficulty in the existing workflow is in the evaluation. Currently we use an alignment step to determine where each input gene maps to the selected pseudo-reference. This alignment is independent from the alignment of generated reads to the pseudo-reference. The hit table that describes how each input gene maps to the master is actually parsed as part of the evaluation step. To compare the input sequences against the recovered haplotypes, we need to know which parts of the input sequence actually overlap the recovered sequences (which share the same co-ordinates as the pseudo-reference), and where the start and end of that overlapping region exists on that particular input. The hit table effectively allows us to convert the co-ordinates of the recovered haplotypes, to those of the input gene. We can then compare bases by position and count how many match (and how many don’t).

Unsurprisingly, this got messy quite quickly, and the situation was exacerbated by subtle disagreements between the alignments of genes to the reference with BLAST and reads to the reference with bowtie2. This caused quite a bit of pain and ultimately ended with me manually inspecting sequences and writing my own BLAST-style hit tables to correct the evaluation process.

One afternoon whilst staring at Tablet and pondering my life choices, I wondered why I was even comparing the input and output sequences in this way. We’re effectively performing a really poor local alignment search between the input and output sequences. Using an aligner such as BLAST to compare the recovered haplotypes to the input sequences seems to be a rather intuitive idea. So why don’t we just do an actual local alignment?

Without a good answer, I tore my haplotype evaluation logic out of Gretel and put it in a metaphorical skip. Now we’ve dramatically simplified the process for generating and evaluating data sets. Hooray.

A bunch of small data sets now exist at github.com/SamStudio8/gretel-test and a framework of questionable bash scripts make the creation of new data sets rather trivial. Wonderful.

Results

So does this all actually work? Without wanting to step on the toes of my next blog post, it seems to, yes.

Accuracy across the triviome harness in general, is very good. Trivial haplotypes with up to 250 SNPs can be recovered in full, even in haplomes consisting of reads from 10 distinct, randomly generated haplotypes. Unsurprisingly, we’ve confirmed that increasing the number of haplotypes, and the number of SNPs on those haplotypes makes recovery more difficult.

To investigate recovery from metahaplomes of real genes, I’ve followed my previous protocol: select some gene (I’ve chosen DHFR and AIMP1), use BLAST to locate five arbitrary but similar genes from the same family, break them into reads and feed the alignment and called SNPs to Gretel with the goal of recovering the original genes. For both the DHFR and AIMP1 data sets, it is possible to get recovery rates of 95-100% for genes that look similar to the psuedo-reference and 80+% for those that are more dissimilar.

The relationship between pseudo-reference similarity and haplotype recovery rates might appear discouraging at first, but after digging around for the reasoning behind this result, it turns out not to be Gretel‘s fault. Reads generated from sequences that have less identity to the pseudo-reference are less likely to align to that pseudo-reference, and are more likely to be discarded. bowtie2 denies Gretel access to critical evidence required to accurately recover these less similar haplotypes.

This finding echoes an overarching theme I have encountered with current genomic tools and pipelines: not only are our current protocols and software not suitable for the analysis of metagenomes, but their underlying assumptions of diploidy are actually detrimental to the analyses we are conducting.

Introducing our Preprint

My work on everything so far has culminated in the production of a preprint: Advances in the recovery of haplotypes from the metagenome. It’s quite humbling to see the sum total of around 18 months of my life summed into a document. Flicking through it a few months after it went online, I still get a warm fuzzy feeling: it looks like real science! I provide a proper definition of the metahaplome and introduce both the underlying graph theory for Hansel and the probability equations for Gretel. It also goes into a whole heap of results obtained so far, some historical background into the problem and where we are now as a field, and an insight into how this approach is different from other methodologies.

Our work is an advance in computational methods for extracting exciting exploitable enzymes from metagenomes.


Conclusion

Things appear to be going well. Next steps are to get in the lab and try Gretel out for real. We are still hunting around for some DNA in a freezer that has a corresponding set of good quality sequenced reads.

In terms of development for Hansel and Gretel, there is still plenty of room for future work (helpfully outlined by my preprint) but in particular I still need to come up with a good way to incorporate quality scores and paired end information. I expect both attributes will improve performance more than the pretty awesome results we are getting already.

For more heavy refactoring, I’d like to also look at replacing Gretel‘s inherent greedy bias (the best choice is always the edge with the highest probability) with something a little more clever. Additionally, the handling of indels is more than certainly going to become the next thorn in my side.

I’d also like to come up with some way of visualising results (I suspect it will involve Circos), because everybody can get behind a pretty graph.


In other news

I went in the lab for the first time…

…it was not a success

Biologists on Reddit actually liked my sassy PCR protocol

Our sys admin left

This happened.

Someone is actually using Goldilocks

I am accidentally in charge of the department 3D printer

Writing a best man’s speech is definitely harder than an academic paper

I am a proper biologist now

Someone trusts me to play with lasers

I acquired some bees

I went up some mountains with a nice man from the internet

I went to a conference…

…I presented some work!

I accidentally went on holiday

…and now I am back.


tl;dr

  • Yes hello, I am still here and appear to be worse at both blog and unit testing
  • My work now has a name: a data structure called Hansel and an algorithm called Gretel
  • Reweighting appears to work in a robust fashion now that it is implemented correctly
  • Smoothing the conditional probabilities of observations should be looked at again
  • Triviomes provide a framework for investigating the effects of various aspects of metagenomes on the success of haplotype recovery
  • Generating synthetic metahaplomes for testing is significantly less of a pain in the ass now I have simplified the process for evaluating them
  • Current protocols and software are not suitable for the analysis of metagenomes (apart from mine)
  • I have a preprint!
  • Indels are almost certainly going to be the next pain in my ass
  • I shunned my responsibilities as a PhD student for a month and travelled Eastern Europe enjoying the sun and telling strangers about how awful bioinformatics is
  • I am now back telling North Western Europeans how awful bioinformatics is

  1. Mhaplotyper? Oh yes, the M is for metagenomic, and it is also silent. 
  2. Having written this paragraph, I wonder what the real impact of this intuition actually is. I’ve now opened an issue on my own repo to track my investigation. In the case where more than L pairs of evidence exist, is there quantifiable loss in accuracy by only considering L pairs of evidence? In the case where fewer than L pairs of evidence exist, does the smoothing have a non-negligible affect on performance? 
  3. Of course, the concept of alignment in a triviome is somewhat undefined as we have no “reference” sequence anyway. Although one could select one of the random haplotypes as a pseudo-reference against which to align all tiny reads, considering the very short read lengths and the low sequence identity between the randomly generated haplotype sequences, it is highly likely that the majority of reads will fail to align correctly (if at all) to such a reference. 
]]>
https://samnicholls.net/2016/10/24/status-sep16/feed/ 0 1201
Status Report: May 2016 (Metahaplomes: The graph that isn’t) https://samnicholls.net/2016/06/12/status-may16/ https://samnicholls.net/2016/06/12/status-may16/#respond Sun, 12 Jun 2016 20:03:24 +0000 https://samnicholls.net/?p=699 It would seem that a sufficient amount of time has passed since my previous report to discuss how everything has broken in the meantime. You would have left off with a version of me who had not long solidified the concept of the metahaplome: a graph-inspired representation of the variation observed across aligned reads from a sequenced metagenome. Where am I now?

Metahaplomes

The graph that isn’t

At the end of my first year, I returned from my reconnaissance mission to extract data mining knowledge from a leading Belgian university with a prototype for a data structure that was fit to house a metahaplome; a probabilistically weighted graph that can be traversed to extract likely sequences of variants on some gene of interest. I say graph, because the structure and API does not look or particularly act like a traditional graph at all. Indeed, the current representation is a four dimensional matrix that stores the number of observations of a symbol (SNP) A at position i, co-occurring with a symbol B at position j.

This has proved problematic as I’ve had difficulty in explaining the significance of this to people who dare to ask what my project is about. “What do you mean it’s not a graph? There’s a picture of a graph on your poster right there!?”. Yes, the matrix can be exploited to build a simple graph representation, but not without some information loss. As a valid gene must select a variant at each site, one cannot draw a graph that contains edges from sites of polymorphisms that are not adjacent (as a path that traverses such an edge would skip a variant site1). We therefore lose the ability to encode any information regarding co-occurrence of non-adjacent variants (abs(i - j) != 1) if we depict the problem with a simple graph alone.

To circumvent this, edges are not weighted upfront. Instead, to take advantage of the evidence available, the graph is dynamically weighted during traversal (the movement to the next node is variable, and depends on the nodes that have been visited already) using the elements stored in the matrix.

Thus we have a data structure capable of being utilised like a graph, with some caveats: it is not possible to enumerate all possibilities or assign weights to all edges upfront before traversal (or for that matter, a random edge), and a fog of war exists during any traversal (i.e. it is not possible to predict where a path may end without exploring). Essentially we have no idea what the graph looks like, until we explore it. Despite this, my solution fuses the advantage of a graph’s simple representation, with the advantage of an adjacency matrix that permits storage of all pertinent information. Finally, I’ve been able to describe the structure and algorithm verbally and mathematically.

Reweighting

Of course, having this traversable structure that takes all the evidence seen across the reads into account is great, but we need a reliable method for rescuing more than just one possible gene variant from the target metahaplome. My initial attempts at this involved invoking stochastic jitter during traversal to quite poor effect. It was not until some time after I’d got back from putting mayonnaise on everything that I considered altering the observation matrix that backs the graph itself to achieve this.

My previous report described the current methodology: given a complete path, check the marginal probability for each variant at each position of the path (i.e. the probability one would select the same nucleotide if you were to look at variant site in isolation) and determine the smallest marginal. Then iterate over the path, down-weighting the element of the observation matrix that stores the number of occurrences of the i‘th nucleotide and the i+1‘th selected nucleotide, by multiplying the existing value by the lowest marginal (which will be greater than 0, but smaller than 1) and subtracting that value from the current count.

Initial testing yielded more accurate results with this method than anything I had tried previously, where accuracy is quantified by this not happening:

The algorithm is evaluated with a data set of several known genes from which a metagenome is simulated. The coloured lines on the chart above refer to each known input gene. The y axis represents the percentage of variants that are “recovered” from the metagenome, the x axis is the iteration (or path) number. In this example, a questionable strategy caused poor performance (other than the 100% recovery of the blue gene), and a bug in handling elements that are reweighted below 1 allowed the algorithm to enter a periodic state.

After implementing the latest strategy, performance compared to the above increased significantly (at least on the limited data sets I have spent the time curating), but I was still not entirely satisfied. Recognising this was going to take much more time and thought, I procrastinated by writing up the technical aspects of my work in excruciating mathematical detail in preparation for my next paper. To wrap my head around my own equations, I commandeered the large whiteboards in the undergraduate computing room and primed myself with coffee and Galantis. Unfortunately, after a hour or two of excited scribbling, this happened:

I encountered an oversight. Bluntly:

Despite waxing lyrical about the importance of evidence arising from non-adjacent variant sites, I’d overlooked them in the reweighting process. Although frustrated with my own incompetence, this issue was uncovered at a somewhat opportune time as I was looking for a likely explanation for what felt like an upper bound on the performance of the algorithm. As evidence (observation counts) for adjacent pairwise variants was decreased through reweighting, non-adjacent evidence was becoming an increasingly important factor in the decision making process for path traversal, simply by virtue of the counts being larger (as they were left untouched). Thus paths were still being heavily coerced along particular routes and were not afforded the opportunity to explore more of the graph, yielding less accurate results (less recovered variants) for more divergent input genes.

As usual in these critical oversights, the fix was trivial (just ensure to apply the same rules for reweighting the adjacent pairs of variants to the non-adjacent ones too), and indeed, performance was bumped by around 5%p. Hooray.

Evaluation

Generating test data (is still a pain in the arse)

So here we are, I’m still somewhat stuck in a data rut. Generating data sets (that can be verified) is a somewhat convoluted procedure. Whilst, to run the algorithm all one needs is a BAM of aligned reads, and an associated VCF of called SNP sites; to empirically test output, we also need to know what the output genes should look like. Currently this requires a “master” FASTA (the origin gene), a FASTA of similar genes (the ones we actually want to recover) and a blast hit table that documents how those similar genes align to the master. The workflow for generating and testing a data set looks like this:

  • Select an interesting, arbitrary master gene from a database (master.fa)
  • blast for similar genes and select several hits with decreasing identity
  • Download FASTA (genes.fa) and associated blast hit table (hits.txt) for selected genes
  • Simulate reads by shredding genes.fa (reads.fq)
  • Align reads (reads.fq) with bowtie to pseudo-reference (master.fa) to create (hoot.bam)
  • Call for SNPs on (hoot.bam) to create a VCF (hoot.vcf)
  • Construct metahaplome and traverse paths with reads (hoot.bam) and SNPs (hoot.vcf)
  • Output potential genes (out.fa)
  • Evaluate each result in out.fa against each hit in hits.txt
    • Extract DNA between subject start and end for record from genes.fa
    • Determine segment of output (from out.fa) overlapping current hit (from genes.fa)
    • Convert co-ordinates of SNP to current hit (gene.fa)
    • Confirm consistency between SNP on output, to
    • Return matrix of consistency for each output gene, to each input gene

Discordant alignments between discordant aligners

The issue at hand primarily arises from discordant alignment decisions between the two alignment processes that make up components of the pipeline; blast and bowtie. Although blast is used to select the initial genes (given some “master”), and its resulting hit table is also used to evaluate the approach at the end of the algorithm, bowtie is used to align the reads (reads.fq) to that same master. Occasional disagreements between both algorithms are inevitable on real data, but I assumed that given the simplicity of the data (simulated reads of uniform length, no errors, reasonable identity) that they would behave the same. It may sound like an obvious problem source, but when several genes are reported as correctly extracted with high accuracy (identity) and one or two are not, you might forgive me for thinking that the algorithm just needed tweaking rather than an underlying problem stemming from the alignment steps! This led to much more tail chasing than I would care to admit to.

For one example: I investigated a poorly reconstructed gene by visualising the input BAM produced by bowtie with Tablet (a rather nifty BAM+VCF+FA viewer). It turned out that for input reads belonging to one of the input genes, bowtie had called an indel1, causing a disagreement as to what the empirical base at each SNP following that indel should have been. That is, although all of the reads from that particular input gene were aligned by bowtie as having an indel (and thus shifting bases in those reads), and processed by my algorithm with that indel taken into account, at the point of evaluation, the blast hit table is the gold standard; what may have been the correct variant (indel withstanding) would be determined as incorrect by the alignment of the hit table.

I suppose the solution might be to switch to one aligner, but I’m aware that even the same aligner can make different decisions under differing conditions (read length).
It’s important to note that currently the hit table is also used to define where to begin sharding reads for the simulated metagenome, which in turn causes trouble if bowtie disagrees with where an alignment begins and ends. I’ve had cases where blast aligns the first base of the subject (input gene) to the first base of the query (master) but on inspection with Tablet, it becomes evident that bowtie clips the first few bases when aligning opening reads to the master. This problem is a little more subtle, and in current practice causes little trouble. Although the effect would be a reduction in observed evidence for variants at SNPs that happen to occur within the first few bases of the gene, my test sets so far do not have a SNP site so close to the start of the master. This is obviously something to watch out for, though.

At this time I’ve just been manually altering the hit table to reconcile differences between the two aligners, which is gross.

Bugs in the machine

Of course, a status report from me is not complete without some paragraph where I hold my hands up in the air and say everything was broken due to my own incompetence. Indeed, a pair of off-by-one bugs in my evaluation algorithm also warped reported results. The first, a regression introduced after altering the parameters under which the evaluation function determines an overlap between the current output gene and current hit, led to a miscalculation when translating the base position on the master to the base position on the expected input under infrequent circumstances, causing the incorrect base to be compared to the expected output. This was also accidentally fixed when I refactored the code and saw a very small increase in performance.

The second, an off-by-one error in the reporting of a new metric: “deviations from reference”, caused the results to suddenly appear rather unimpressive. The metric measures the number of bases that are different from the pseudo-reference (master.fa) that were correctly recovered by my algorithm to match an original gene from gene.fa. Running my algorithm now yielded a results table describing impressive gene recovery scores (>89%) but those genes only appeared to differ from the reference by merely a few SNPs (<10). How could we suck at recovering sequences that barely deviate from the master? Why does it take so many iterations? After getting off the floor and picking up the shattered pieces of my ego and PhD, I checked the VCF and confirmed there were over a hundred SNPs across all the genes. Curious, I inspected the genes manually with Tablet to see how they compared to the reference. Indeed, there were definitely more than the four reported for one particular case, so what was going on?

To finish quickly; path iteration numbers start from 0, but are reported to the user as iter + 1, because the 0’th iteration is not catchy. My mistake was using the iter + 1 to also access the number of deviations from the reference detected in the current iteration – in a zero indexed structure. I was fetching the number of deviations successfully extracted by the path after the this one, which we would expect to be poor, as the structure would have been reweighted to prevent that path from appearing again. Nice work, me. This fix made things a little more interesting:

More testing of the testing, is evidently necessary.

Conclusion

So where does that leave us? Performance is up, primarily because the code that I wrote to evaluate performance (and reweight) is now less broken. Generating data sets is still a pain in the arse, but I have got the hang of the manual process involved so I can at least stop hiding from the work to be done. It might be worth investigating consolidating all of my alignment activities into one aligner to improve my credit score. Results are looking promising, this algorithm is now capable of extracting genes (almost or entirely whole) from simulated (albeit quite simple) metagenomes.

Next steps will be more testing, writing the method itself as a paper, and getting some proper biological evidence from the lab that this work can do what I tell people it can do.

In other news


tl;dr

  • I continue to be alive and bad at both blog and implementing experimental data structures
  • I fixed my program not working by fixing the thing that told me it wasn’t working
  • If your evaluator has holes in, you’ll spend weeks chasing problems that don’t exist
  • Never assume how someone else’s software will work, especially if you are assuming it will work like a different piece of software that you are already making assumptions about
  • Always be testing (especially testing the testing)
  • This thing actually fucking works
  • An unpleasant side effect of doing a PhD is the rate of observed existential crises increases
  • Life continues to be a series of off-by-one errors, punctuated with occasional trips to the seaside

  1. Let’s not even talk about indels for now. 
]]>
https://samnicholls.net/2016/06/12/status-may16/feed/ 0 699
Teaching children how to be a sequence aligner with Lego at Science Week https://samnicholls.net/2016/03/29/abersciweek16/ https://samnicholls.net/2016/03/29/abersciweek16/#respond Tue, 29 Mar 2016 22:59:46 +0000 https://samnicholls.net/?p=612 As part of a PhD it is anticipated1 that you will share your science with various audiences; fellow PhD students, peers in the field and the various publics. Every year, the university celebrates British Science Week with a Science Fair, inviting possibly the most difficult public to engage with: children. Over three days the fair serves to educate and entertain 1700 pupils from over 30 schools based across Mid Wales, and this year I volunteered2 to run a stand.

How to explain assembly?

I was inspired by Amanda’s activity for prospective students at a visiting day a few weeks prior. To describe the problem of DNA sequence assembly and alignment in a friendly (and quick) way, Amanda had hundreds of small pieces of paper representing DNA reads. The read set was generated with Titus Brown’s shotgunator tool, slicing a few sentences about the problem (meta!) into k-mers, with a few errors and omissions for good measure. Visitors were asked to help us assemble the original sequence (the sentences) by exploiting the overlaps between reads.

I like this activity as it gives a reasonable intuition for how assembly of genomes works, using just scraps of paper. Key is that the DNA is abstracted into something more tangible to newcomers – English words building sentences – which is far simpler to explain and understand, especially in a short time. It’s also quite easy to describe some of the more complicated issues of assembly, namely errors and repeats via misspellings and repeated words or phrases.

A problem with pigeonholing college students?

Yet to my surprise, the majority of the compscis-to-be were quite apprehensive of taking on the task at the mere mention of this being a biological problem, despite the fact that sequence alignment can be easily framed as a text manipulation problem. Their apprehension only increased when introduced to Amanda’s genome game; a fun web-based game that generates a small population with a short binary genome whose rules must be guessed before the time runs out. A few puzzled visitors offered various flavours of “…but I’m not here to do biology!”, and one participant backed out of playing with “…but biology is scary and too hard!”. In general the activities had a reasonable reception but visitors appeared more interested in the Arduinos, web games and robots – their comfort zone, presumably.

One need not necessarily be an expert in biology (I’m certainly not) to be able to contribute to the study of computationally framed questions in that field. As mentioned, DNA alignment is effectively string manipulation and those strings could be anything! Indeed this is even demonstrated by our activity using English sentences rather than the alphabet ACGT.

From experience, undergraduates (and apparently college students) appear keen to pigeonhole themselves early (“…dammit Jim I’m a computer scientist not a bioinformatician”) via their prior beliefs to the meaning of “computing”, and their module/A-level choices. I think it is at this stage where subjects outside one’s choices become “scary” and fall outside one’s scope of interest — “…if I wanted to learn biology why would I be doing compsci?”. Yet most jobs from finance to game development will require some domain specific knowledge and reading outside computing, whether its economics, physics or even art and soundscape design.

This is why it is important as a computer science department that we introduce undergraduates to other potential applications of the field. It’s not that we should push students to study bioinformatics over robotics, but that many students can easily go on unaware that computing can be widely applicable to research endeavours in different fields in the first place. Though to combat the “this is not my area” issue, in our department, many assignments have a real-world element, often just tidbits of domain specific knowledge that force students to recognise the need for base understanding of something outside of their comfort zone.

Lego: a unicorn-like universal engagement tool

College students aside, I needed to work out how to engage schoolchildren between the ages of 10-12 with this activity. Scraps of paper would be unlikely to hold the attention of my target age group for long. I needed something more tangible and less fiddly than strips of paper. It was while describing the problem of introducing these “building blocks of nature” to kids in a simple way when the perfect metaphor popped into mind: Lego.

Yes! A 2×2 brick can represent an individual nucleotide, and we can use different coloured bricks to colour code the four nucleotides (and maybe another for “missing” if we’re feeling mean). A small stack of bricks builds a short string of DNA to represent a read. The colour code effectively abstracts away the potentially-confusing ACGT alphabet, making the alignment game easier to play (matching just colours, rather than symbols that need parsing first) and also quite aesthetically pleasing.

The hard part, was sourcing enough Lego. I returned to my parents’ home to dig through my childhood and retrieve years worth of collected pieces, but once back in Aberystwyth I was surprised to find that after sorting through two whole boxes I did not own more than some 100 2×2 bricks (and most were not in colours I wanted). Bricks, it appears, are actually quite hard to come by! I put out a request for help on the Aber Comp Sci Facebook group and a lecturer kindly performed the same sort with his children’s collections. Their collection must have been more substantial and yielded 150-200 bricks in a mix of four colours, saving my stand.

The setup

The activity itself is simple and needs nothing other than some patter, the Lego and a surface for kids to align the pieces on. I spent more time than I would like to admit covering a cardboard box with tinfoil to create the SAMTECH SEQUENCER 9000 (described by Illumina as “shiny”), a prop to contextualise the problem: we can’t look at whole genomes, only short pieces of it that need assembly.

IMG_20160315_121713284

Of course, we’d need some read sets. To make these, I divided the available bricks into two piles, Nathan and I then each ad-libbed sliding k-mers of length 5 (i.e. each stack would have stacks with overlaps of length 4, 3, 2 and 1 coloured brick – which each had their own overlaps…) to build up an arbitrary genome to recover. Simple!

Running the activity

Once doors opened, there was no shortage of children wanting to try out the stand. I think the mystery of the tinfoil box and the allure of playing with Lego was enough to grab attention, though Nathan (my lovely assistant) and I would flag down passers-by if the table was free. Pupils were encouraged to visit as many activities as possible by means of a questionnaire, on which each stand posed a scientific question that could be answered by completing that particular stand’s activity. Unfortunately for us, our stand’s question was not included on the questionnaire (I guess we submitted it too late) but luckily, we found pupils were keen to write down and find an answer to our “bonus question” after all.

We quickly developed a double-act routine; opening by quizzing our aligners on what they knew about DNA, which was typically not much, though it was nice to hear that the majority were aware that “it’s inside us”. Interestingly, of the pupils who responded in the positive to being asked what DNA was, their exposure was primarily from television – specifically when used for identification of criminals. Nathan would then explain that if we wanted to look at somebody’s DNA, we would take a sample from them and process it with the shiny tinfoil sequencer. This special machine would apply some magic science and produce short DNA reads that had to be pieced back together to recover the whole genome.

At this point we’d invite participants to open the lid of the sequencer and take out a batch of reads (of a possible two sets) for assembly. We’d explain the rules and show some examples of a correct alignment: sequences of matching runs of colour between two or more Lego stacks. Once they got the hang of it, we’d leave them to it for a little while. The two sets meant that we could split larger groups into pairs or triplets to ensure that everybody had a chance to make some successful alignments.

As the teams came to finishing alignment of the most obvious motifs (Nathan and I both accidentally made a few triplets of colours that resembled well known flags in our read sets – which was handy), progress would begin to slow and a few more difficult or red-herring reads would be left over, and Nathan or I would start narrating the problem, asking teams if this had been more difficult than expected. I don’t think any team agreed that the activity had been easy! We used this as an opportunity to interrupt the game to frame how complicated assembly is for real sequences and reveal the answer to our question.

The debrief

This was my favourite part, I’d hold up one of the Lego stacks and pull it apart. “Each of these bricks is a single base, stacked together they make this read which tells us a what a small part of a much longer genome looks like”. I’d then ask how long they imagine a whole human genome might be. Answers most frequently ranged between 100 – 1000, a minority guessed between 4 – 15. No pupil ventured guesses beyond a million. For the very small guesses, I’d assemble a Lego stack of that length and ask if they still thought the differences between us all could be explained by such a short genome – nobody changed their mind3.

The look on their faces when I revealed it was actually three billion made the entire activity worth it. If we had enough Lego to build a genome, it would be 28,800km tall and stretch into space far beyond where global positioning satellites are in orbit. I’d explain that when we do this for real, the stacks aren’t five bases long, but more like a hundred, and instead of the handful of reads we had in our tinfoil sequencer, there were millions of reads to align and assemble. They’d gasp and look around at each-other’s faces, equally stunned. We even had some teachers dumbfounded by this reveal. “This is why computers are now so important in biology, this would be impossible otherwise!”. We’d clear up any last questions or confusions and thank them for playing.

Some observations

I would not consider our first group a rallying success. I was not ready for how difficult assembly of a set of unique 5-mers would be. The group had significant trouble recovering the genome and as it turned out, Nathan and I did too. The situation had not been helped by the fact that the group had also taken a mix of reads from both batches in the tinfoil sequencer. As it turns out, even trivial assembly is really hard. I could tell the kids were somewhat disappointed and the difficulty of the game had hampered their enjoyment. We recovered by wowing them with facts about the human genome and they asked some good questions too. Once they left the table, Nathan began the patter with the next group as I hurriedly worked to reduce the number of red-herring reads and recycle the bricks to create duplicate reads which allowed groups to make progress more quickly at the beginning (and effectively turned difficulty into a ramp, rather than uniformly hard to play). This improved further games considerably.

I was surprised how happily the pupils were to append our fairly long question to an already quite lengthy questionnaire, and how keen they were to find the answer, too. Not a single pupil was put off from our activity at the mention of biology, DNA or even unfamiliar terminology like “sequencer”, or “read”. Fascinatingly, Amanda also ran the aforementioned genome game and it was a hit. I guess primary school students are just open to a very wide definition of science and are yet to pigeonhole themselves? Activities like this at an early age have the potential to massively influence how our next generation of scientists see science as a large collaborative effort, skills can be transferred and shared to solve important and interesting questions. The pupils simply had no idea that computers could be used like this, for science, let alone biologically inspired questions.

In general the activity went down very well, the kids seem to get the concept very quickly and also understood the (albeit naive) parallel to DNA. I think they genuinely learned a thing or two (the human genome is big!) and enjoyed themselves. I’m pleased that we managed to draw and keep attention to our stand, given we were wedged between a bunch of old Atari consoles and a display of unmanned aerial vehicles.

I was definitely surprised at how much I enjoyed running the stand too. I’m not overly fond of children and was expecting to have to put on a brave face to deal with tiny disinterested people in assorted bright sweaters all day. Yet all but one or two pupils were happy to be here, incredibly enthusiastic to learn, asked great questions (sometimes incredibly insightful questions) and genuinely had a nice time and thanked us for it. Enjoyment aside, I took the second day off as I’d also found running the activity over and over, oddly draining.

Future activities

If I were to run this again, I’d like to make it a little more interactive and ideally give players a chance to actually use Lego for its intended purpose: building something. Thankfully at our stand, students were not particularly disappointed when our rules stated that couldn’t take the reads apart, or put them together (i.e. couldn’t actually play with the Lego…). To improve, my idea would be to get participants to construct a short genome out of Lego pieces that can be truly “sequenced” by pushing it through some sort of colour sensor or camera apparatus attached to an Arduino inside a future iteration of the trusty SAMTECH Sequencer range. Some trivial software would then give the player some sort of monster to name4, print off and call their own.

To run the activity again in its current form, I think I’d need to have more Lego. However, it turns out that packs of 2×2 bricks in one colour are widely available on eBay and Amazon, though aren’t actually that much cheaper than ordering via the “Pick a Brick” service on the canonical Lego website. I’ve ordered a few packs (at an astonishing £0.12 per brick) as I would like to try and run this activity at other events to spread the sheer joy that bioinformatics can bring to one’s afternoon.

To give the current version of the game a little more of a goal, it would have been ideal to explain the concept of a genomic reference and have the players align the reads to that (as well as eachother), in effect this would have been like solving the edges of a jigsaw and given a sense of quick progress (which means fun) and also afford us the opportunity to explain more of the “real science” behind the game. To make the game more difficult, we could have properly employed “missing bases” and the common issues that plague assembly including repeats (which is easier to explain with a reference), as well as errors. After the first group at the Science Fair, I quickly removed the majority of sneaky errors as it made the game too “mean” (where Nathan or I had to explain “No that one doesn’t go there!” too frequently).

Some proof what I did public engagement5

tl;dr

  • Actual Lego bricks are hard to come by (unless you just buy them)
  • Typical ten year olds are not as dumb or as apathetic to science as one might expect
  • Assembly is actually pretty hard
  • Engaging with children with science is exhausting but surprisingly rewarding
  • Acquire more Lego
  • It’s very hard to tinfoil a cardboard box nicely

  1. Read, required. 
  2. Read, was coerced. 
  3. With a single Lego brick in hand, one kid looked me dead in the eye and said “Yeah!” when asked if this single base could explain the differences between every human on Earth. 
  4. Genome McGenface? 
  5. Absolutely not using this to pass my public engagement module. 
]]>
https://samnicholls.net/2016/03/29/abersciweek16/feed/ 0 612
Goldilocks: A tool for identifying genomic regions that are “just right” https://samnicholls.net/2016/03/08/goldilocks/ https://samnicholls.net/2016/03/08/goldilocks/#respond Tue, 08 Mar 2016 11:05:10 +0000 https://samnicholls.net/?p=608 application note on Bioinformatics.]]> I’m published! I’m a real scientist now! Goldilocks, my Python package for locating regions on a genome that are “just right” (for some user-provided definition of just right) is published software and you can check out the application note on Bioinformatics Advance Access, download the tool with pip install goldilocks, view the source on Github and read the documentation on readthedocs.

]]>
https://samnicholls.net/2016/03/08/goldilocks/feed/ 0 608