Status Report: 2018: The light is at the end of the tunnel that I continue to build

   Sam Nicholls    No Comments yet    AU-PhD, Status Report

Happy New Year!
The guilt of not writing has reached a level where I feel sufficiently obligated to draft a post. You’ll likely notice from the upcoming contents that I am still a PhD student, despite a previous, more optimistic version of myself writing that 2016 would be my final Christmas as a PhD candidate.

Much has happened since my previous Status Report, and I’m sure much of it will spin-off to form several posts of their own, eventually. For the sake of brevity, I’ll give a high level overview.
I’m supposed to be writing a thesis anyway.


Previously on…

We last parted ways with a doublebill status report lamenting the troubles of generating suitable test data for my metagenomic haplotype recovery algorithm, and documenting the ups-and-downs-and-ups-again of analysing one of the synthetic data sets for my pre-print. In particular, I was on a quest to respond to our reviewer’s desire for more realistic data: real reads.

Gretel: Now with real reads!

Part Two of my previous report alluded to a Part Three that I never got around to finishing, on the creation and analysis of a test data set consisting of real reads. This was a major concern of the reviewers who gave feedback on our initial pre-print. Without getting into too much detail (I’m sure there’s time for that); I found a suitable data set consisting of real sequence reads from a lab-mix of five HIV strains, used to benchmark algorithms in the related problem of viral-quasispecies reconstruction. After fixing a small bug, and implementing deletion handling, it turns out we do well on this difficult problem. Very well.

In the same fashion as our synthetic DHFR metahaplome, this HIV data set provided five known haplotypes, representing five different HIV-1 strains. Importantly, we were also provided with real Illumina short-reads from a sequencing run containing a mix of the five known strains. This was our holy grail, finally: a benchmark with sequence reads and a set of known haplotypes. Gretel is capable of recovering long, highly variable genes with 100% accuracy. My favourite result is a recovery of env — the ridiculously hyper-variable envelope gene that encodes the HIV-1 virus’ protein shell — with Gretel correctly recovering all but one of 2,568 positions. Not bad.

A new pre-print

Armed with real-reads, and improved results for our original DHFR test data (thanks to some fiddling with bowtie2), we released a new pre-print. The manuscript was a substantial improvement over its predecessor, which meant it was all the more disappointing to be rejected from five different journals. But, more on this misery at another time.

Despite our best efforts to address the previous concerns, new reviewers felt that our data sets were still not a good representation of the problem-at-hand: “Where is the metagenome?”. It felt like the goal-posts had moved, suddenly real reads were not enough. But it’s both a frustrating and fair response, work should be empirically validated, but there are no metagenomic data sets with both a set of sequence reads, and known haplotypes. So, it was time to make one.

I’m a real scientist now…

And so, I embarked upon what would become the most exciting and frustrating adventure of my PhD. My first experiences of the lab as a computational biologist is a post sat in draft, but suffice to say that the learning curve was steep. I’ve discovered that there are many different types of water and that they all look the same, that 1ml is a gigantic volume, that you’ll lose your fingerprints if you touch a metal drawer inside a -80C freezer, and that contrary to what I might have thought before, transferring tiny volumes of colourless liquids between tiny tubes without fucking up a single thing, takes a lot of time, effort and skill. I have a new appreciation for the intricate and stochastic nature of lab work, and I understand what it’s like for someone to “borrow” a reagent that you spent hours of your time to make from scratch. And finally, I had a legitimate reason to wear an ill-fitting lab coat that I purchased in my first year (2010), to look cool at computer science socials.

With this new-found skill-tree to work on, I felt like I was becoming a proper interdisciplinary scientist, but this comes at a cost. Context switching isn’t cheap, and I was reminded of my undergraduate days where I juggled mathematics, statistics and computing to earn my joint honours degree. I had more lectures, more assignments and more exams than my peers, but this was and still is the cost of my decision to become an interdisciplinary scientist.

And it was often difficult to find much sympathy from either side of the venn diagram…

..and science can be awful

I’ve suffered many frustrations as a programmer. One can waste hours tracking down a bug that turns out to be a simple typo, or more likely, an off by one error that plagues much of bioinformatics. I’ve felt the self-directed anger having submitted thousands of cluster jobs that have failed with a missing parameter, or waited hours for a program to complete, only to discover the disk has run out of room to store the output. Yet, these problems pale into comparison in the face of problems at the bench.

I’ve spent days in the lab, setting-up and executing PCR, casting, loading and running gels, only to take a UV image of absolutely nothing at all.

Last year, I spent most of Christmas sheparding data through our cluster, much to my family’s dismay. This year, I had to miss a large family do for a sister’s milestone birthday. I spent many midnights in the lab, lamenting the life of a PhD student, and shuffling around with angry optimism; “Surely it has to fucking work this time?”. Until finally, I got what I wanted.

I screamed so loud with glee that security came to check on me. “I’m a fucking scientist now!”

New Nanopore Toys

My experiment was simple in practice. Computationally, I’d predicted haplotypes with my Gretel method from short-read Illumina data from a real rumen microbiome. I designed 10 pairs of primers to capture 10 genes of interest (with hydrolytic-activity) using the haplotypes. And finally, after several weeks of constant almost 24/7 lab work, building cDNA libraries and amplifying the genes of interest, I made enough product for the exciting next step: Nanopore sequencing.

With some invaluable assistance from our resident Nanopore expert Arwyn Edwards (@arwynedwards) and PhD student André (@GeoMicroSoares), I sequenced my amplicons on an Oxford Nanopore MinION, and the results were incredible.

Our Nanopore reads strongly supported our haplotypes, and concurred with the Sanger sequencing. Finally, we have empirical biological evidence that Gretel works.

The pre-print rises

With this bomb-shell in the bag, the third version of my pre-print rose from the ashes of the second. We demoted the DHFR and HIV-1 data sets to the Supplement, and included an analysis on our performance with a de facto benchmark mock community introduced by Chris Quince in its place. The data sets and evaluation mechanisms that our previous reviewers found unrepresentative and convoluted were gone. I even got to include a Circos plot.

Once more, we substantially updated the manuscript, and released a new pre-print. We made our to bioRxiv to much Twitter fanfare, earning over 1,500 views in our first week.

This work also addresses every piece of feedback we’ve had from reviewers in the past. Surely, the publishing process would now finally recognise our work and send us out for review, right?

Sadly, the journey of this work is still not smooth sailing, with three of my weekends marred by a Friday desk rejection…

…and a fourth desk rejection on the last working day before Christmas was pretty painful. But we are currently grateful to be in discussion with an editor and I am trying to remain hopeful we will get where we want to be in the end. Wish us luck!


In other news…

Of course, I am one for procrastination, and have been keeping busy while all this has been unfolding…

I hosted a national student conference

I am applying for some fellowships

I’ve officially started my thesis…

…which is just as well, because the money is gone

I’ve started making cheap lab tat with my best friend…

…it’s approved by polar bears

…and the UK Centre for Astrobiology

…and has been to the Arctic

I gave an invited talk at a big conference…

…it seemed to go down well

I hosted UKIEPC at Aber for the 4th year

We’ve applied to fund Monster Lab…

…and made a website to catalogue our monsters

For a change I chose my family over my PhD and had a fucking great Christmas


What’s next?

  • Get this fucking great paper off my desk and out of my life
  • Hopefully get invited to some fellowship interviews
  • Continue making cool stuff with Sam and Tom Industrys
  • Do more cool stuff with Monster Lab
  • Finish this fucking thesis so I can finally do something else

tl;dr

  • Happy New Year
  • For more information, please re-read