bioinformatics – Samposium https://samnicholls.net The Exciting Adventures of Sam Mon, 15 Jan 2018 22:12:03 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 Status Report: 2018: The light is at the end of the tunnel that I continue to build https://samnicholls.net/2018/01/15/status-jan18-p1/ https://samnicholls.net/2018/01/15/status-jan18-p1/#respond Mon, 15 Jan 2018 20:39:54 +0000 https://samnicholls.net/?p=2254 Happy New Year!
The guilt of not writing has reached a level where I feel sufficiently obligated to draft a post. You’ll likely notice from the upcoming contents that I am still a PhD student, despite a previous, more optimistic version of myself writing that 2016 would be my final Christmas as a PhD candidate.

Much has happened since my previous Status Report, and I’m sure much of it will spin-off to form several posts of their own, eventually. For the sake of brevity, I’ll give a high level overview.
I’m supposed to be writing a thesis anyway.


Previously on…

We last parted ways with a doublebill status report lamenting the troubles of generating suitable test data for my metagenomic haplotype recovery algorithm, and documenting the ups-and-downs-and-ups-again of analysing one of the synthetic data sets for my pre-print. In particular, I was on a quest to respond to our reviewer’s desire for more realistic data: real reads.

Gretel: Now with real reads!

Part Two of my previous report alluded to a Part Three that I never got around to finishing, on the creation and analysis of a test data set consisting of real reads. This was a major concern of the reviewers who gave feedback on our initial pre-print. Without getting into too much detail (I’m sure there’s time for that); I found a suitable data set consisting of real sequence reads from a lab-mix of five HIV strains, used to benchmark algorithms in the related problem of viral-quasispecies reconstruction. After fixing a small bug, and implementing deletion handling, it turns out we do well on this difficult problem. Very well.

In the same fashion as our synthetic DHFR metahaplome, this HIV data set provided five known haplotypes, representing five different HIV-1 strains. Importantly, we were also provided with real Illumina short-reads from a sequencing run containing a mix of the five known strains. This was our holy grail, finally: a benchmark with sequence reads and a set of known haplotypes. Gretel is capable of recovering long, highly variable genes with 100% accuracy. My favourite result is a recovery of env — the ridiculously hyper-variable envelope gene that encodes the HIV-1 virus’ protein shell — with Gretel correctly recovering all but one of 2,568 positions. Not bad.

A new pre-print

Armed with real-reads, and improved results for our original DHFR test data (thanks to some fiddling with bowtie2), we released a new pre-print. The manuscript was a substantial improvement over its predecessor, which meant it was all the more disappointing to be rejected from five different journals. But, more on this misery at another time.

Despite our best efforts to address the previous concerns, new reviewers felt that our data sets were still not a good representation of the problem-at-hand: “Where is the metagenome?”. It felt like the goal-posts had moved, suddenly real reads were not enough. But it’s both a frustrating and fair response, work should be empirically validated, but there are no metagenomic data sets with both a set of sequence reads, and known haplotypes. So, it was time to make one.

I’m a real scientist now…

And so, I embarked upon what would become the most exciting and frustrating adventure of my PhD. My first experiences of the lab as a computational biologist is a post sat in draft, but suffice to say that the learning curve was steep. I’ve discovered that there are many different types of water and that they all look the same, that 1ml is a gigantic volume, that you’ll lose your fingerprints if you touch a metal drawer inside a -80C freezer, and that contrary to what I might have thought before, transferring tiny volumes of colourless liquids between tiny tubes without fucking up a single thing, takes a lot of time, effort and skill. I have a new appreciation for the intricate and stochastic nature of lab work, and I understand what it’s like for someone to “borrow” a reagent that you spent hours of your time to make from scratch. And finally, I had a legitimate reason to wear an ill-fitting lab coat that I purchased in my first year (2010), to look cool at computer science socials.

With this new-found skill-tree to work on, I felt like I was becoming a proper interdisciplinary scientist, but this comes at a cost. Context switching isn’t cheap, and I was reminded of my undergraduate days where I juggled mathematics, statistics and computing to earn my joint honours degree. I had more lectures, more assignments and more exams than my peers, but this was and still is the cost of my decision to become an interdisciplinary scientist.

And it was often difficult to find much sympathy from either side of the venn diagram…

..and science can be awful

I’ve suffered many frustrations as a programmer. One can waste hours tracking down a bug that turns out to be a simple typo, or more likely, an off by one error that plagues much of bioinformatics. I’ve felt the self-directed anger having submitted thousands of cluster jobs that have failed with a missing parameter, or waited hours for a program to complete, only to discover the disk has run out of room to store the output. Yet, these problems pale into comparison in the face of problems at the bench.

I’ve spent days in the lab, setting-up and executing PCR, casting, loading and running gels, only to take a UV image of absolutely nothing at all.

Last year, I spent most of Christmas sheparding data through our cluster, much to my family’s dismay. This year, I had to miss a large family do for a sister’s milestone birthday. I spent many midnights in the lab, lamenting the life of a PhD student, and shuffling around with angry optimism; “Surely it has to fucking work this time?”. Until finally, I got what I wanted.

I screamed so loud with glee that security came to check on me. “I’m a fucking scientist now!”

New Nanopore Toys

My experiment was simple in practice. Computationally, I’d predicted haplotypes with my Gretel method from short-read Illumina data from a real rumen microbiome. I designed 10 pairs of primers to capture 10 genes of interest (with hydrolytic-activity) using the haplotypes. And finally, after several weeks of constant almost 24/7 lab work, building cDNA libraries and amplifying the genes of interest, I made enough product for the exciting next step: Nanopore sequencing.

With some invaluable assistance from our resident Nanopore expert Arwyn Edwards (@arwynedwards) and PhD student André (@GeoMicroSoares), I sequenced my amplicons on an Oxford Nanopore MinION, and the results were incredible.

Our Nanopore reads strongly supported our haplotypes, and concurred with the Sanger sequencing. Finally, we have empirical biological evidence that Gretel works.

The pre-print rises

With this bomb-shell in the bag, the third version of my pre-print rose from the ashes of the second. We demoted the DHFR and HIV-1 data sets to the Supplement, and included an analysis on our performance with a de facto benchmark mock community introduced by Chris Quince in its place. The data sets and evaluation mechanisms that our previous reviewers found unrepresentative and convoluted were gone. I even got to include a Circos plot.

Once more, we substantially updated the manuscript, and released a new pre-print. We made our to bioRxiv to much Twitter fanfare, earning over 1,500 views in our first week.

This work also addresses every piece of feedback we’ve had from reviewers in the past. Surely, the publishing process would now finally recognise our work and send us out for review, right?

Sadly, the journey of this work is still not smooth sailing, with three of my weekends marred by a Friday desk rejection…

…and a fourth desk rejection on the last working day before Christmas was pretty painful. But we are currently grateful to be in discussion with an editor and I am trying to remain hopeful we will get where we want to be in the end. Wish us luck!


In other news…

Of course, I am one for procrastination, and have been keeping busy while all this has been unfolding…

I hosted a national student conference

I am applying for some fellowships

I’ve officially started my thesis…

…which is just as well, because the money is gone

I’ve started making cheap lab tat with my best friend…

…it’s approved by polar bears

…and the UK Centre for Astrobiology

…and has been to the Arctic

I gave an invited talk at a big conference…

…it seemed to go down well

I hosted UKIEPC at Aber for the 4th year

We’ve applied to fund Monster Lab…

…and made a website to catalogue our monsters

For a change I chose my family over my PhD and had a fucking great Christmas


What’s next?

  • Get this fucking great paper off my desk and out of my life
  • Hopefully get invited to some fellowship interviews
  • Continue making cool stuff with Sam and Tom Industrys
  • Do more cool stuff with Monster Lab
  • Finish this fucking thesis so I can finally do something else

tl;dr

  • Happy New Year
  • For more information, please re-read
]]>
https://samnicholls.net/2018/01/15/status-jan18-p1/feed/ 0 2254
Bioinformatics is a disorganised disaster and I am too. So I made a shell. https://samnicholls.net/2016/11/16/disorganised-disaster/ https://samnicholls.net/2016/11/16/disorganised-disaster/#respond Wed, 16 Nov 2016 17:50:59 +0000 https://samnicholls.net/?p=1581 If you don’t want to hear me wax lyrical about how disorganised I am, you can skip ahead to where I tell you about how great the pseudo-shell that I made and named chitin is.

Back in 2014, about half way through my undergraduate dissertation (Application of Machine Learning Techniques to Next Generation Sequencing Quality Control), I made an unsettling discovery.

I am disorganised.

The discovery was made after my supervisor asked a few interesting questions regarding some of my earlier discarded analyses. When I returned to the data to try and answer those questions, I found I simply could not regenerate the results. Despite the fact that both the code and each “experiment” were tracked by a git repository and I’d written my programs to output (what I thought to be) reasonable logs, I still could not reproduce my science. It could have been anything: an ad-hoc, temporary tweak to a harness script, a bug fix in the code itself masking a result, or any number of other possible untracked changes to the inputs or program parameters. In general, it was clear that I had failed to collect all pertinent metadata for an experiment.

Whilst it perhaps sounds like I was guilty of negligent book-keeping, it really wasn’t for lack of trying. Yet when dealing with many interesting questions at once, it’s so easy to make ad-hoc changes, or perform undocumented command line based munging of input data, or accidentally run a new experiment that clobbers something. Occasionally, one just forgets to make a note of something, or assumes a change is temporary but for one reason or another, the change becomes permanent without explanation. These subtle pipeline alterations are easily made all the time, and can silently invalidate swathes of results generated before (and/or after) them.

Ultimately, for the purpose of reproducibility, almost everything (copies of inputs, outputs, logs, configurations) was dumped and tar‘d for each experiment. But this approach brought problems of its own: just tabulating results was difficult in its own right. In the end, I was pleased with that dissertation, but a small part of me still hurts when I think back to the problem of archiving and analysing those result sets.

It was a nightmare, and I promised it would never happen again.

Except it has.

A relapse of disorganisation

Two years later and I’ve continued to be capable of convincing a committee to allow me to progress towards adding the title of doctor to my bank account. As part of this quest, recently I was inspecting the results of a harness script responsible for generating trivial haplotypes, corresponding reads and attempting to recover them using Gretel. “Very interesting, but what will happen if I change the simulated read size”, I pondered; shortly before making an ad-hoc change to the harness script and inadvertently destroying the integrity of the results I had just finished inspecting by clobbering the input alignment file used as a parameter to Gretel.

Argh, not again.

Why is this hard?

Consider Gretel: she’s not just a simple standalone tool that one can execute to rescue haplotypes from the metagenome. One must go through the motions of pushing their raw reads through some form of pipeline (pictured below) to generate an alignment (to essentially give a co-ordinate system to those reads) and discover the variants (the positions in that co-ordinate system that relate to polymorphisms on reads) that form the required inputs for the recovery algorithm first.

This is problematic for one who wishes to be aware of the providence of all outputs of Gretel, as those outputs depend not only on the immediate inputs (the alignment and called variants), but the entirety of the pipeline that produced them. Thus we must capture as much information as possible regarding all of the steps that occur from the moment the raw reads hit the disk, up to Gretel finishing with extracted haplotypes.

But as I described in my last status report, these tools are themselves non-trivial. bowtie2 has more switches than an average spaceship, and its output depends on its complex set of parameters and inputs (that also have dependencies on previous commands), too.

img_20161110_103257

bash scripts are all well and good for keeping track of a series of commands that yield the result of an experiment, and one can create a nice new directory in which to place such a result at the end – along with any log files and a copy of the harness script itself for good measure. But what happens when future experiments use different pipeline components, with different parameters, or we alter the generation of log files to make way for other metadata? What’s a good directory naming strategy for archiving results anyway? What if parts (or even all of the) analysis are ad-hoc and we are left to reconstruct the history? How many times have you made a manual edit to a malformed file, or had to look up exactly what combination of sed, awk and grep munging you did that one time?

One would have expected me to have learned my lesson by now, but I think meticulous digital lab book-keeping is just not that easy.

What does organisation even mean anyway?

I think the problem is perhaps exacerbated by conflating the meaning of “organisation”. There are a few somewhat different, but ultimately overlapping problems here:

  • How to keep track of how files are created
    What command created file foo? What were the parameters? When was it executed, by whom?
  • Be aware of the role that each file plays in your pipeline
    What commands go on to use file foo? Is it still needed?
  • Assure the ongoing integrity of past and future results
    Does this alignment have reads? Is that FASTA index up to date?
    Are we about to clobber shared inputs (large BAMS, references) that results depend on?
  • Archiving results in a sensible fashion for future recall and comparison
    How can we make it easy to find and analyse results in future?

Indeed, my previous attempts at organisation address some but not all of these points, which is likely the source of my bad feeling. Keeping hold of bash scripts can help me determine how files are created, and the role those files go on to play in the pipeline; but results are merely dumped in a directory. Such directories are created with good intent, and named something that was likely useful and meaningful at the time. Unfortunately, I find that these directories become less and less useful as archive labels as time goes on… For example, what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100/1?

This approach also had no way to assure the current and future integrity of my results. Last month I had an issue with Gretel outputting bizarrely formatted haplotype FASTAs. After chasing my tail trying to find a bug in my FASTA I/O handling, I discovered this was actually caused by an out of date FASTA index (.fai) on the master reference. At some point I’d exchanged one FASTA for another, assuming that the index would be regenerated automatically. It wasn’t. Thus the integrity of experiments using that combination of FASTA+index was damaged. Additionally, the integrity of the results generated using the old FASTA were now also damaged: I’d clobbered the old master input.

There is a clear need to keep better metadata for files, executed commands and results, beyond just tracking everything with git. We need a better way to document the changes a command makes in the file system, and a mechanism to better assure integrity. Finally we need a method to archive experimental results in a more friendly way than a time-sensitive graveyard of timestamps, acronyms and abbreviations.

So I’ve taken it upon myself to get distracted from my PhD to embark on a new adventure to save myself from ruining my PhD2, and fix bioinformatics for everyone.

Approaches for automated command collection

Taking the number of post-its attached to my computer and my sporadically used notebooks as evidence enough to outright skip over the suggestion of a paper based solution to these problems, I see two schools of thought for capturing commands and metadata computationally:

  • Intrusive, but data is structured with perfect recall
    A method whereby users must execute commands via some sort of wrapper. All commands must have some form of template that describes inputs, parameters and outputs. The wrapper then “fills in” the options and dispatches the command on the user’s behalf. All captured metadata has uniform structure and nicely avoids the need to attempt to parse user input. Command reconstruction is perfect but usage is arguably clunky.
  • Unobtrusive, best-effort data collection
    A daemon-like tool that attempts to collect executed commands from the user’s shell and monitor directories for file activity. Parsing command parameters and inputs is done in a naive best-effort scenario. The context of parsed commands and parameters is unknown; we don’t know what a particular command does, and cannot immediately discern between inputs, outputs, flags and arguments. But, despite the lack of structured data, the user does not notice our presence.

There is a trade-off between usability and data quality here. If we sit between a user and all of their commands, offering a uniform interface to execute any piece of software, we can obtain perfectly structured information and are explicitly aware of parameter selections and the paths of all inputs and desired outputs. We know exactly where to monitor for file system changes, and can offer user interfaces that not only merely enumerate command executions, but offer searching and filtering capabilities based on captured parameters: “Show me assemblies that used a k-mer size of 31”.

But we must ask ourselves, how much is that fine-grained data worth to us? Is exchanging our ability to execute commands ourselves worth the perfectly structured data we can get via the wrapper? How much of those parameters are actually useful? Will I ever need to find all my bowtie2 alignments that used 16 threads? There are other concerns here too: templates that define a job specification must be maintained. Someone must be responsible for adding new (or removing old) parameters to these templates when tools are updated. What if somebody happens to misconfigure such a template? More advanced users may be frustrated at being unable to merely execute their job on the command line. Less advanced users could be upset that they can’t just copy and paste commands from the manual or biostars. What about smaller jobs? Must one really define a command template to run trivial tools like awk, sed, tail, or samtools sort through the wrapper?

It turns out I know the answer to this already: the trade-off is not worth it.

Intrusive wrappers don’t work: a sidenote on sunblock

Without wanting to bloat this post unnecessarily, I want to briefly discuss a tool I’ve written previously, but first I must set the scene3.

Within weeks of starting my PhD, I made a computational enemy in the form of Sun Grid Engine: the scheduler software responsible for queuing, dispatching, executing and reporting on jobs submitted to the institute’s cluster. I rapidly became frustrated with having an unorganised collection of job scripts, with ad-hoc edits that meant I could no longer re-run a job previously executed with the same submission script (does this problem sound familiar?). In particular, I was upset with the state of the tools provided by SGE for reporting on the status of jobs.

To cheer myself up, I authored a tool called sunblock, with the goal of never having to look at any component of Sun Grid Engine directly ever again. I was successful in my endeavour and to this day continue to use the tool on the occasion where I need to use the cluster.

screenshot-from-2016-11-16-16-11-11

However, as hypothesised above, sunblock does indeed require an explicit description of an interface for any job that one would wish to submit to the cluster, and it does prevent users from just pasting commands into their terminal. This all-encompassing wrapping feature; that allows us to capture the best, structured information on every job, is also the tool’s complete downfall. Despite the useful information that could be extracted using sunblock (there is even a shiny sunblock web interface), its ability to automatically re-run jobs and the superior reporting on job progress compared to SGE alone, was still not enough to get user traction in our institute.

For the same reason that I think more in-the-know bioinformaticians don’t want to use Galaxy, sunblock failed: because it gets in the way.

Introducing chitin: an awful shell for awful bioinformaticians

Taking what I learned from my experimentation with sunblock on-board, I elected to take the less intrusive, best-effort route to collecting user commands and file system changes. Thus I introduce chitin: a Python based tool that (somewhat)-unobtrusively wraps your system shell, to keep track of commands and file manipulations to address the problem of not knowing how any of the files in your ridiculously complicated bioinformatics pipeline came to be.

I initially began the project with a view to create a digital lab book manager. I envisaged offering a command line tool with several subcommands, one of which could take a command for execution. However as soon as I tried out my prototype and found myself prepending the majority of my commands with lab execute, I wondered whether I could do better. What if I just wrapped the system shell and captured all entered commands? This might seem a rather dumb and long-about way of getting one’s command history, but consider this: if we wrap the system shell as a means to capture all the input, we are also in a position to capture the output for clever things, too. Imagine a shell that could parse the stdout for useful metadata to tag files with…

I liked what I was imagining, and so despite my best efforts to get even just one person to convince me otherwise; I wrote my own pseudo-shell.

chitin is already able to track executed commands that yield changes to the file system. For each file in the chitin tree, there is a full modification history. Better yet, you can ask what series of commands need to be executed in order to recreate a particular file in your workflow. It’s also possible to tag files with potentially useful metadata, and so chitin takes advantage of this by adding the runtime4, and current user to all executed commands for you.

Additionally, I’ve tried to find my own middle ground between the sunblock-esque configurations that yielded superior metadata, and not getting in the way of our users too much. So one may optionally specify handlers that can be applied to detected commands, and captured stdout/stderr. For example, thanks to my bowtie2 configuration, chitin tags my out.sam files with the overall alignment rate (and a few targeted parameters of interest), automatically.

screenshot-from-2016-11-16-17-21-30

chitin also allows you to specify handlers for particular file formats to be applied to files as they are encountered. My environment, for example, is set-up to count the number of reads inside a BAM, and associate that metadata with that version of the file:

screenshot-from-2016-11-16-17-30-55

In this vein, we are in a nice position to check on the status of files before and after a command is executed. To address some of my integrity woes, chitin allows you to define integrity handlers for particular file formats too. Thus my environment warns me if a BAM has 0 reads, is missing an index, or has an index older than itself. Similarly, an empty VCF raises a warning, as does an out of date FASTA index. Coming shortly will be additional checks for whether you are about to clobber a file that is depended on by other files in your workflow. Kinda cool, even if I do say so myself.

Conclusion

Perhaps I’m trying to solve a problem of my own creation. Yet from a few conversations I’ve had with folks in my lab, and frankly, anyone I could get to listen to me for five minutes about managing bioinformatics pipelines, there seems to be sympathy to my cause. I’m not entirely convinced myself that a “shell” is the correct solution here, but it does seem to place us in the best position to get commands entered by the user, with the added bonus of getting stdout to parse for free. Though, judging by the flurry of Twitter activity on my dramatically posted chitin screenshots lately, I suspect I am not so alone in my disorganisation and there are at least a handful of bioinformaticians out there who think a shell isn’t the most terrible solution to this either. Perhaps I just need to be more of a wet-lab biologist.

Either way, I genuinely think there’s a lot of room to do cool stuff here, and to my surprise, I’m genuinely finding chitin quite useful already. If you’d like to try it out, the source for chitin is open and free on GitHub. Please don’t expect too much in the way of stability, though.


tl;dr

  • A definition of “being organised” for science and experimentation is hard to pin down
  • But independent of such a definition, I am terminally disorganised
  • Seriously what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd1001
  • I think command wrappers and platforms like Galaxy get in the way of things too much
  • I wrote a “shell” to try and compensate for this
  • Now I have a shell, it is called chitin

  1. This is a genuine directory in my file system, created about a month ago. It contains results for a run of Gretel against the pol gene on the HIV genome (2084-5083). Off the top of my head, I cannot recall what sd100 is, or why reg appears before the base positions. I honestly tried. 
  2. Because more things that are not my actual PhD is just what my PhD needs. 
  3. If it helps you, imagine some soft jazz playing to the sound of rain while I talk about this gruffly in the dark with a cigarette poking out of my mouth. Oh, and everything is in black and white. It’s bioinformatique noir
  4. I’m quite pleased with this one, because I pretty much always forget to time how long my assemblies and alignments take. 
]]>
https://samnicholls.net/2016/11/16/disorganised-disaster/feed/ 0 1581
Interdisciplinary talks and the metaphor-ome: Harder than metagenomics itself? https://samnicholls.net/2016/11/03/talks-and-metaphors/ https://samnicholls.net/2016/11/03/talks-and-metaphors/#respond Thu, 03 Nov 2016 21:18:12 +0000 https://samnicholls.net/?p=1390 Yesterday I spoke at the Centre of Computational Biology at Birmingham University. I was invited to give a talk as part of their research seminar series about the work I have been doing on metagenomes. The lead up to this has been pretty nerve-wracking as this was my first talk outside of Aberystwyth (since my short introductory talk at KU Leuven last year), and the majority of my previous talks have been to my peers, which I find to be a lot less intimidating than a room full of experts of various fields.

Metaphorical Metagenomes

I submitted the current working title of my PhD: “Extracting exciting exploitable enzymes from massive metagenomes“, which I think is a rather catchy summary of what I’m working on here. I borrowed the opening slides from my previous talks (this is a cow…) but felt like I needed to try a new explanation of why the metagenome is so difficult to work with. Previously, I’ve described the problem with jigsaw puzzles: i.e. consider many distinct (but visually similar) jigsaws, mixed together (with duplicate and missing pieces). Whilst this is a nice, accessible description that appears to serve well, it tends to leave some listeners confused about my objective, particularly:

  • You are recovering whole genomes?
    The jigsaw metaphor doesn’t lend well to the metahaplome and the concept of assembling a jigsaw partially. Listeners assume we want to piece together each of the different jigsaws in our box, whole – presumably because people find those who don’t finish jigsaws terrible.
  • We can assemble jigsaws incorrectly?
    Metagenomic assemblies are a consensus sequence of the observed reads. The resulting sequence is unlikely to exist in nature. Whilst we can extend our metaphor to explain that pieces of jigsaws may have the same shape, such that we can put together puzzles that don’t exist, this is not immediately obvious to listeners.

A common analogy for genomic assembly is that of pages shredded from a book. I’ve also previously pitched this at a talk to try and explain metagenomic assembly, but this has some disastrous metaphorical pitfalls too:

  • You are recovering whole books?
    Akin to the jigsaw analogy, listeners don’t immediately see why we would only want to assemble parts of a book. What part? A chapter? A page? A paragraph? Which leads to…
  • Why are there paragraphs shared between books?
    To describe our problem of recovering genes that appear across multiple species, we must say that we are attempting to recover some shared sequence of words from across many books. This somewhat breaks the metaphor as this isn’t a problem that exists, and so the concept just causes listener confusion, rather than helping them to understand our problem. Whilst we could point out the Bible as an example of a book that has been translated and shared to a point where two copies of the text do feature differences between their passages, we figure it best to avoid conversations about the Bible and shredded books.
  • You are assembling words into sentences? The problem is easy?
    DNA has a limited alphabet: A, C, G and T. But books can contain a practically infinite combination of character sequences given their larger alphabets. This larger alphabet makes distinguishing sequence similarity incredibly simple compared to that of DNA. Right now I’m using an alphabet of about 95 characters (upper and lowercase characters, numbers and a subset of symbols) in this post, and although it’s possible that one or more of my sentences could appear elsewhere on the web (unintentionally), the probability of this will be many, many orders of magnitude smaller than that of finding duplication of DNA sequences within and between genomes. Thus by comparing the problem to reconstructing pages from a book, we are at a very real risk of underselling the difficulty of the problem at hand.

Additionally, both analogies fail to explain:

  • Intra-object variation
    We must also shoehorn the concept of intraspecies gene variation into these metaphors which turns out rather clunky. We do say that books and jigsaws have misprints and errors, but this doesn’t truly emphasise that there is real variation between instances of the same object.
  • What is the biological significance anyway?
    Neither description of the problem comes close to explaining why we’d even want to retrieve the different but similar-looking areas of a jigsaw, or copies of a page or passage shared across multiple books.

Machines and Factories: A new metaphor?

So, I spent some time iterating over many ideas and came up with a new concept of “genes as machines” and “genomes as factories”:

Genes

Consider a gene as a physical machine. It’s configuration is set by altering the state of its switches. The configuration of a machine is akin to a sequence of DNA. It is possible (and even intended) that the machine can be configured in many different ways by changing the state of its switches (like gene variants), but it is still the same machine (the same gene). This is an important concept because we want to describe that a machine can have many configurations (that can do nothing, improve performance, or even break it), whilst still remaining the same machine (i.e. a variant of a gene).

factories-02

Factories

We can consider a genome as a factory, holding a collection of machines and their configurations:

factories-07

We can extend this metaphor to groups of factories under a parent organisation (i.e. a species) who can set the configuration of their machines autonomously – introducing intra-species variation as a concept. Additionally we can describe groups of factories under other parent organisations (species) that also deploy the same machine to their own factories, also configuring them differently – introducing not only intra-species variation, but multiple sets of intra-species variants too:

factories-09

Talk the Talk

Armed with my shiny diagrams and apprehension of my own new metaphor, I pitched it to my listeners as a test and thanked them for their role as guinea pigs to my new attempt at explaining the core concept of the metagenome and its problems.

In general, I felt like the audience followed along with the metaphor to begin with. Given a fictional company: Enzytech and their machine: Awesomease, we could define the metahaplome as the collection of configurations of that Enzytech Awesomease product across multiple factories, under various parent companes (i.e. different genomes, of varying species). However I think the story unravelled when I described the process of actually recovering the metahaplome.

I set the scene: Sam from Enzytech wondered why factories configured their Awesomease differently. Sam figured there must be an interesting meaning to these configurations – do some combinations of switches cause the Awesomease to -ase more awesomely? Thus, Sam approaches each parent company and requests their Enzytech Awesomease configurations. In a surprising gesture of co-operation, the businesses comply and return all their Enzytech Awesomease configurations, for all of their factories. Unfortunately, and perhaps in breach of their own trade secrets, they also submit the configurations of every other machine (gene) in each of their factories (genomes) too:

factories-21

To make matters worse, the configurations don’t describe the specific factory they are from (i.e. the individual organism), and their returned documents also include incomplete, broken and duplicated configurations. Lost configurations are not submitted.

I think at this point, I was getting too wrapped up in the metaphor and its story. The concept of metaphorical factories submitting bad paperwork to fictional Sam from Enzytech did not have an obvious biological reference point (it was supposed to describe metagenomic sampling). I think with practice, I could deliver this part better such that my audience understands the relevance to biology, but I am not sure it is necessary. Where things definitely did not work was this slide:

factories-13

“Unfortunately, an Enzytech intern misfiled the paperwork submitted by all of the parent companies’ factories (species and their various genomes), and we could no longer determine which company submitted which configuration. The same clumsy intern then accidentally shredded all of the configurations, too.”

Welp. I am somewhat cringing at the amount of biological shoehorning going on in just one slide. Here I wanted to explain that although my pretty pictures have helpful colour coding for each of the companies (species), we don’t have this in our data. A configuration (gene variant) could come from any factory (genome) in our sample, and there is no way of differentiating them. Although shredding is a (common) reference to short-read sequencing technology, the delivery of this slide feels as clumsy as the Enzytech intern. I think the mistake I have made here was trying to use the same metaphorical explanation for two separate and distinct problems that I face in my work on metagenomes:

  • The metahaplome
    We need to clearly define what the metahaplome actually is as it is a term we coined. It is also the objective of my algorithm, and so failing to adequately describe this means it is unclear why this work has a biological relevance (or is worth a PhD).
  • Metagenomes, assembly, and short read sequencing
    This final slide attempts to describe metagenomes and sequencing, as shredded paperwork relating to many different genes, from multiple factories that are held by various parent companies, all mixed together. But in fact, for this part of the metaphor it is easier to just say “bits of DNA, from a gene, on multiple organisms, from multiple species in the same environmental sample”…

On this occasion, I believe I managed to explain the metahaplome more clearly to an audience than ever before, though this might be in part because this is my first talk since our pre-print. However, in forcing my new metaphor onto the latter problem (of sequencing), I think I inadvertently convoluted what the metagenome is. So ultimately, I’m not entirely convinced the new metaphor panned out with a mixed audience of expert computer scientists and biologists. That said, I had several excellent questions following the talk, that seemed to imply a deep understanding of the work I presented, so hooray! Regardless of whether I deploy it for my next talk, I think it will still prove a nice way to explain my work to the public at large (who may have no frame of reference to get confused with).

I enjoyed the opportunity to visit somewhere new and speak about my work, especially as my first invited talk outside of Aberystwyth. This is also a reminder that even sharing thoughts and progress on cross-discipline work is hard. It’s a lot of work to come up with a way to get everyone in the audience on the same page; capable of speaking the same language (biological, mathematical and computational terminology) and also give the necessary background knowledge (genomic sequencing and metagenomes) to even begin to pitch the novelty and importance of our own work.


Obligatory proof that people attended:

Obligatory omg my heart rate:

screenshot-from-2016-11-03-16-05-09


tl;dr

  • I was invited to speak at Birmingham, it was nice
  • It’s super hard to come up with explanations of your work that will please everyone
  • Spending until 4am drawing some rather shiny diagrams is perhaps not the best reason to push forth with a new metaphor that even you feel a little uneasy about
  • I continue to speak too bloody quickly
  • My body still gives the physiological impression I am doing exercise whilst speaking publicly
]]>
https://samnicholls.net/2016/11/03/talks-and-metaphors/feed/ 0 1390
Teaching children how to be a sequence aligner with Lego at Science Week https://samnicholls.net/2016/03/29/abersciweek16/ https://samnicholls.net/2016/03/29/abersciweek16/#respond Tue, 29 Mar 2016 22:59:46 +0000 https://samnicholls.net/?p=612 As part of a PhD it is anticipated1 that you will share your science with various audiences; fellow PhD students, peers in the field and the various publics. Every year, the university celebrates British Science Week with a Science Fair, inviting possibly the most difficult public to engage with: children. Over three days the fair serves to educate and entertain 1700 pupils from over 30 schools based across Mid Wales, and this year I volunteered2 to run a stand.

How to explain assembly?

I was inspired by Amanda’s activity for prospective students at a visiting day a few weeks prior. To describe the problem of DNA sequence assembly and alignment in a friendly (and quick) way, Amanda had hundreds of small pieces of paper representing DNA reads. The read set was generated with Titus Brown’s shotgunator tool, slicing a few sentences about the problem (meta!) into k-mers, with a few errors and omissions for good measure. Visitors were asked to help us assemble the original sequence (the sentences) by exploiting the overlaps between reads.

I like this activity as it gives a reasonable intuition for how assembly of genomes works, using just scraps of paper. Key is that the DNA is abstracted into something more tangible to newcomers – English words building sentences – which is far simpler to explain and understand, especially in a short time. It’s also quite easy to describe some of the more complicated issues of assembly, namely errors and repeats via misspellings and repeated words or phrases.

A problem with pigeonholing college students?

Yet to my surprise, the majority of the compscis-to-be were quite apprehensive of taking on the task at the mere mention of this being a biological problem, despite the fact that sequence alignment can be easily framed as a text manipulation problem. Their apprehension only increased when introduced to Amanda’s genome game; a fun web-based game that generates a small population with a short binary genome whose rules must be guessed before the time runs out. A few puzzled visitors offered various flavours of “…but I’m not here to do biology!”, and one participant backed out of playing with “…but biology is scary and too hard!”. In general the activities had a reasonable reception but visitors appeared more interested in the Arduinos, web games and robots – their comfort zone, presumably.

One need not necessarily be an expert in biology (I’m certainly not) to be able to contribute to the study of computationally framed questions in that field. As mentioned, DNA alignment is effectively string manipulation and those strings could be anything! Indeed this is even demonstrated by our activity using English sentences rather than the alphabet ACGT.

From experience, undergraduates (and apparently college students) appear keen to pigeonhole themselves early (“…dammit Jim I’m a computer scientist not a bioinformatician”) via their prior beliefs to the meaning of “computing”, and their module/A-level choices. I think it is at this stage where subjects outside one’s choices become “scary” and fall outside one’s scope of interest — “…if I wanted to learn biology why would I be doing compsci?”. Yet most jobs from finance to game development will require some domain specific knowledge and reading outside computing, whether its economics, physics or even art and soundscape design.

This is why it is important as a computer science department that we introduce undergraduates to other potential applications of the field. It’s not that we should push students to study bioinformatics over robotics, but that many students can easily go on unaware that computing can be widely applicable to research endeavours in different fields in the first place. Though to combat the “this is not my area” issue, in our department, many assignments have a real-world element, often just tidbits of domain specific knowledge that force students to recognise the need for base understanding of something outside of their comfort zone.

Lego: a unicorn-like universal engagement tool

College students aside, I needed to work out how to engage schoolchildren between the ages of 10-12 with this activity. Scraps of paper would be unlikely to hold the attention of my target age group for long. I needed something more tangible and less fiddly than strips of paper. It was while describing the problem of introducing these “building blocks of nature” to kids in a simple way when the perfect metaphor popped into mind: Lego.

Yes! A 2×2 brick can represent an individual nucleotide, and we can use different coloured bricks to colour code the four nucleotides (and maybe another for “missing” if we’re feeling mean). A small stack of bricks builds a short string of DNA to represent a read. The colour code effectively abstracts away the potentially-confusing ACGT alphabet, making the alignment game easier to play (matching just colours, rather than symbols that need parsing first) and also quite aesthetically pleasing.

The hard part, was sourcing enough Lego. I returned to my parents’ home to dig through my childhood and retrieve years worth of collected pieces, but once back in Aberystwyth I was surprised to find that after sorting through two whole boxes I did not own more than some 100 2×2 bricks (and most were not in colours I wanted). Bricks, it appears, are actually quite hard to come by! I put out a request for help on the Aber Comp Sci Facebook group and a lecturer kindly performed the same sort with his children’s collections. Their collection must have been more substantial and yielded 150-200 bricks in a mix of four colours, saving my stand.

The setup

The activity itself is simple and needs nothing other than some patter, the Lego and a surface for kids to align the pieces on. I spent more time than I would like to admit covering a cardboard box with tinfoil to create the SAMTECH SEQUENCER 9000 (described by Illumina as “shiny”), a prop to contextualise the problem: we can’t look at whole genomes, only short pieces of it that need assembly.

IMG_20160315_121713284

Of course, we’d need some read sets. To make these, I divided the available bricks into two piles, Nathan and I then each ad-libbed sliding k-mers of length 5 (i.e. each stack would have stacks with overlaps of length 4, 3, 2 and 1 coloured brick – which each had their own overlaps…) to build up an arbitrary genome to recover. Simple!

Running the activity

Once doors opened, there was no shortage of children wanting to try out the stand. I think the mystery of the tinfoil box and the allure of playing with Lego was enough to grab attention, though Nathan (my lovely assistant) and I would flag down passers-by if the table was free. Pupils were encouraged to visit as many activities as possible by means of a questionnaire, on which each stand posed a scientific question that could be answered by completing that particular stand’s activity. Unfortunately for us, our stand’s question was not included on the questionnaire (I guess we submitted it too late) but luckily, we found pupils were keen to write down and find an answer to our “bonus question” after all.

We quickly developed a double-act routine; opening by quizzing our aligners on what they knew about DNA, which was typically not much, though it was nice to hear that the majority were aware that “it’s inside us”. Interestingly, of the pupils who responded in the positive to being asked what DNA was, their exposure was primarily from television – specifically when used for identification of criminals. Nathan would then explain that if we wanted to look at somebody’s DNA, we would take a sample from them and process it with the shiny tinfoil sequencer. This special machine would apply some magic science and produce short DNA reads that had to be pieced back together to recover the whole genome.

At this point we’d invite participants to open the lid of the sequencer and take out a batch of reads (of a possible two sets) for assembly. We’d explain the rules and show some examples of a correct alignment: sequences of matching runs of colour between two or more Lego stacks. Once they got the hang of it, we’d leave them to it for a little while. The two sets meant that we could split larger groups into pairs or triplets to ensure that everybody had a chance to make some successful alignments.

As the teams came to finishing alignment of the most obvious motifs (Nathan and I both accidentally made a few triplets of colours that resembled well known flags in our read sets – which was handy), progress would begin to slow and a few more difficult or red-herring reads would be left over, and Nathan or I would start narrating the problem, asking teams if this had been more difficult than expected. I don’t think any team agreed that the activity had been easy! We used this as an opportunity to interrupt the game to frame how complicated assembly is for real sequences and reveal the answer to our question.

The debrief

This was my favourite part, I’d hold up one of the Lego stacks and pull it apart. “Each of these bricks is a single base, stacked together they make this read which tells us a what a small part of a much longer genome looks like”. I’d then ask how long they imagine a whole human genome might be. Answers most frequently ranged between 100 – 1000, a minority guessed between 4 – 15. No pupil ventured guesses beyond a million. For the very small guesses, I’d assemble a Lego stack of that length and ask if they still thought the differences between us all could be explained by such a short genome – nobody changed their mind3.

The look on their faces when I revealed it was actually three billion made the entire activity worth it. If we had enough Lego to build a genome, it would be 28,800km tall and stretch into space far beyond where global positioning satellites are in orbit. I’d explain that when we do this for real, the stacks aren’t five bases long, but more like a hundred, and instead of the handful of reads we had in our tinfoil sequencer, there were millions of reads to align and assemble. They’d gasp and look around at each-other’s faces, equally stunned. We even had some teachers dumbfounded by this reveal. “This is why computers are now so important in biology, this would be impossible otherwise!”. We’d clear up any last questions or confusions and thank them for playing.

Some observations

I would not consider our first group a rallying success. I was not ready for how difficult assembly of a set of unique 5-mers would be. The group had significant trouble recovering the genome and as it turned out, Nathan and I did too. The situation had not been helped by the fact that the group had also taken a mix of reads from both batches in the tinfoil sequencer. As it turns out, even trivial assembly is really hard. I could tell the kids were somewhat disappointed and the difficulty of the game had hampered their enjoyment. We recovered by wowing them with facts about the human genome and they asked some good questions too. Once they left the table, Nathan began the patter with the next group as I hurriedly worked to reduce the number of red-herring reads and recycle the bricks to create duplicate reads which allowed groups to make progress more quickly at the beginning (and effectively turned difficulty into a ramp, rather than uniformly hard to play). This improved further games considerably.

I was surprised how happily the pupils were to append our fairly long question to an already quite lengthy questionnaire, and how keen they were to find the answer, too. Not a single pupil was put off from our activity at the mention of biology, DNA or even unfamiliar terminology like “sequencer”, or “read”. Fascinatingly, Amanda also ran the aforementioned genome game and it was a hit. I guess primary school students are just open to a very wide definition of science and are yet to pigeonhole themselves? Activities like this at an early age have the potential to massively influence how our next generation of scientists see science as a large collaborative effort, skills can be transferred and shared to solve important and interesting questions. The pupils simply had no idea that computers could be used like this, for science, let alone biologically inspired questions.

In general the activity went down very well, the kids seem to get the concept very quickly and also understood the (albeit naive) parallel to DNA. I think they genuinely learned a thing or two (the human genome is big!) and enjoyed themselves. I’m pleased that we managed to draw and keep attention to our stand, given we were wedged between a bunch of old Atari consoles and a display of unmanned aerial vehicles.

I was definitely surprised at how much I enjoyed running the stand too. I’m not overly fond of children and was expecting to have to put on a brave face to deal with tiny disinterested people in assorted bright sweaters all day. Yet all but one or two pupils were happy to be here, incredibly enthusiastic to learn, asked great questions (sometimes incredibly insightful questions) and genuinely had a nice time and thanked us for it. Enjoyment aside, I took the second day off as I’d also found running the activity over and over, oddly draining.

Future activities

If I were to run this again, I’d like to make it a little more interactive and ideally give players a chance to actually use Lego for its intended purpose: building something. Thankfully at our stand, students were not particularly disappointed when our rules stated that couldn’t take the reads apart, or put them together (i.e. couldn’t actually play with the Lego…). To improve, my idea would be to get participants to construct a short genome out of Lego pieces that can be truly “sequenced” by pushing it through some sort of colour sensor or camera apparatus attached to an Arduino inside a future iteration of the trusty SAMTECH Sequencer range. Some trivial software would then give the player some sort of monster to name4, print off and call their own.

To run the activity again in its current form, I think I’d need to have more Lego. However, it turns out that packs of 2×2 bricks in one colour are widely available on eBay and Amazon, though aren’t actually that much cheaper than ordering via the “Pick a Brick” service on the canonical Lego website. I’ve ordered a few packs (at an astonishing £0.12 per brick) as I would like to try and run this activity at other events to spread the sheer joy that bioinformatics can bring to one’s afternoon.

To give the current version of the game a little more of a goal, it would have been ideal to explain the concept of a genomic reference and have the players align the reads to that (as well as eachother), in effect this would have been like solving the edges of a jigsaw and given a sense of quick progress (which means fun) and also afford us the opportunity to explain more of the “real science” behind the game. To make the game more difficult, we could have properly employed “missing bases” and the common issues that plague assembly including repeats (which is easier to explain with a reference), as well as errors. After the first group at the Science Fair, I quickly removed the majority of sneaky errors as it made the game too “mean” (where Nathan or I had to explain “No that one doesn’t go there!” too frequently).

Some proof what I did public engagement5

tl;dr

  • Actual Lego bricks are hard to come by (unless you just buy them)
  • Typical ten year olds are not as dumb or as apathetic to science as one might expect
  • Assembly is actually pretty hard
  • Engaging with children with science is exhausting but surprisingly rewarding
  • Acquire more Lego
  • It’s very hard to tinfoil a cardboard box nicely

  1. Read, required. 
  2. Read, was coerced. 
  3. With a single Lego brick in hand, one kid looked me dead in the eye and said “Yeah!” when asked if this single base could explain the differences between every human on Earth. 
  4. Genome McGenface? 
  5. Absolutely not using this to pass my public engagement module. 
]]>
https://samnicholls.net/2016/03/29/abersciweek16/feed/ 0 612
Goldilocks: A tool for identifying genomic regions that are “just right” https://samnicholls.net/2016/03/08/goldilocks/ https://samnicholls.net/2016/03/08/goldilocks/#respond Tue, 08 Mar 2016 11:05:10 +0000 https://samnicholls.net/?p=608 application note on Bioinformatics.]]> I’m published! I’m a real scientist now! Goldilocks, my Python package for locating regions on a genome that are “just right” (for some user-provided definition of just right) is published software and you can check out the application note on Bioinformatics Advance Access, download the tool with pip install goldilocks, view the source on Github and read the documentation on readthedocs.

]]>
https://samnicholls.net/2016/03/08/goldilocks/feed/ 0 608
Status Report: February 2016 https://samnicholls.net/2016/03/01/status-feb16/ https://samnicholls.net/2016/03/01/status-feb16/#comments Tue, 01 Mar 2016 23:41:33 +0000 https://samnicholls.net/?p=558 I have a meeting with Amanda tomorrow morning about my Next Paper™, so thought it might be apt to gather some thoughts and report on the various states of disarray that the different facets of my PhD are currently in. Although I’ve briefly outlined the goal of my PhD in a previous status report, as of yet I’ve avoided exploring much of the detail here; partly as the work is unpublished, but primarily due to my laze when it comes to blogging1.

The Metahaplome

At the end of January I gave a talk at the Aberystwyth Bioinformatics Workshop2. The talk briskly sums up the work done so far over the first year-and-a-bit of my PhD and introduces the metahaplome: our very own new -ome, a graph-inspired representation of the variation in single nucleotide polymorphisms observed across aligned reads from a sequenced metagenome. The idea is to isolate and store information only the genomic positions that actually vary across sequenced reads and more importantly, keep track of the observed evidence for these variations to co-occur together. This evidence can be exploited to reconstruct sequences of variants that are likely to actually exist in nature, as opposed to the crude approximations provided by assembly-algorithm-de-jour.

I spent the summer of last year basking in the expertise of the data mining group at KU Leuven; learning to drive on the wrong side of the road, enjoying freshly produced breads and chocolate, incorrectly arguing that ketchup should be applied to fries instead of mayonnaise and otherwise pretending to be Belgian. I took with me two different potential representations for the metahaplome and hoped to come back with an efficient Dutch solution that would solve the problem quickly and accurately. Instead, amongst the several kilograms of chocolate, I returned with the crushing realisation that the problem I was dealing with was certainly NP-hard (i.e. very hard) and that my suggestion of abusing probability was likely the best candidate for generating solutions.

The trip wasn’t a loss however: my best friend and I explored some Belgian cities, the coast of the Nederlands and accidentally crossed the invisible Belgian border into the French-speaking Walloon, much to our confusion. I discovered mayonnaise wasn’t all that bad, attended a public thesis defence and had the honour of an invite to celebrate the award of a PhD by getting drunk in a castle. I discarded several implementations of the data structures used to house the metahaplome and began work on a program that could parse sequenced reads into the latest structure. I came up with a method for calculating weights of edges in the graph, and another method for approximating those calculations after they also proved as unwieldy as the graph itself.

The metahaplome is approximations all the way down.

But the important question, will this get me a PhD does it work? Can my implementation be fed some sequenced reads that (probably) span a gene that is shared but variable between some number of different species, sampled together in a metagenome? The short answer is yes and no. The long answer is I’m not entirely sure yet.

Trouble In Silico

I’m at an empirical impasse, the algorithm performs very well or very poorly and occasionally traps itself in a bizarre periodic state, depending on the nature of the input data. Currently the as-of-yet unnamed3 metahaplome algorithm is being evaluated against several data sets which can be binned in one of three categories:

  • Triviomes: Simulated-Simulated Data
    Generates a short gene with N single nucleotide polymorphisms. The gene has M different known variants (sets of N SNPs) with each mi expressed in a simulated sample with some proportion. A script generates a SAM and VCF for the reads and SNP positions respectively. The metahaplome is constructed and traversed and the algorithm is evaluated by its ability to recover the M known variants.

  • Simulated-Real Data
    A gene is pulled from a database and submitted to BLAST. Sequences of similar but not exact identity are identified and aligned to the original gene. The extracted hits are aligned to the original gene and variants are called loosely with samtools. Each gene is then fragmented into k-mers that act as artificial reads for the construction of the metahaplome. In a similar fashion to before, the metahaplome is traversed and the algorithm is evaluated by its ability to recover the genes extracted from the BLAST search. Although using real data, this method is still rather naive in itself and further analysis would be needed to evaluate the algorithm’s stability when encountering:

    • Indels
    • Noise and error
    • Poor coverage
    • Very skewed proportions of mi
  • Real Data for Real
    Variant calling is completed on real reads that align to a region on a metagenomic assembly that looks “interesting”. A metahaplome is constructed and traversed. The resulting paths typically match hypothetical or uncharacterised proteins with some identity. This is exciting and impossible to evaluate empirically which is nice because nobody can prove how the results are incorrect yet.

In general the algorithm performs well on triviomes, which is good news considering their simplicity. However, mixed results are gained from simulated-real data, but I don’t have enough evidence as to why this is the case. The real issue here stems from the difficulty in generating test data in an acceptable form for my own software. Reads must be aligned and SNPs called beforehand, but the software for assembly, variant calling and short read alignment are external to my own work and can produce results that I might not consider optimal for evaluation. In particular, when generating triviomes, I had difficulties with getting a short read aligner to make read alignments that I would expect to see — for this reason, at this time the triviome script generates its own SAM.

Problems at both ends

My trouble isn’t limited to the construction of the metahaplome either. Whilst the majority of initial paths recovered by my algorithm are on target to a gene that we know exists in the sample, we want to go on to recover the second, third, …, i‘th best paths from the graph. To do this, the edges in the graph must be re-weighted. My preliminary work shows there is quite an optimal knife-edge here: aggressive re-weighting causes the algorithm to fail to return similar paths (even if they do really exist in the evidence), but modest re-weighting causes the algorithm to converge on new paths slowly (or not at all).

The situation is further complicated by coverage. An “important” edge in the graph (i.e. is expected to be included in many of the actual genes) may have very little evidence, and aggressive re-weighting doesn’t afford the algorithm the opportunity to explore such branches before they are effectively pruned away. Any form of re-weighting must consider that some edges are covered more than others, but it is unknown to us whether that is due to over-representation in the sample or whether that edge really should appear as part of many paths.

My current strategy is triggered when a path has been recovered. For each edge in the newly extracted path (where an edge represents one SNP followed by another), the marginal distributions of the selected transition is inspected. Every selected edge is then reduced in proportion to the value of the lowest marginal: i.e. the least likely transition observed on the new path. Thus far this seems to strike a nice balance but testing has been rather limited.

What now?

  • Simplify generation of evaluation data sets, currently this bottleneck is a pain in the ass and holding up progress.
  • Standardise testing and keep track of results as part of a test suite instead of ad-hoc tweak-and-test.
  • Use multiple new curated simulated-real data sets to explore and optimise the algorithm’s behaviour.
  • Jury still out on edge re-weighting methodology.

In Other News

  • Publishing of first paper imminent!
  • Co-authored a research grant to acquire funding to test the results of the metahaplome recovery algorithm in a lab.
  • My PR to deprecate legacy samtools sort syntax was accepted for the 1.3 release and I got thanked on the twitters :’)
  • A couple of samtools odd-jobs, including a port of bamcheckR to samtools stats in the works…
  • sunblock still saving me hours of head-banging-on-desk time but not tidy enough to tell you about yet…
  • I’ll be attending the Microbiology Society Annual Conference in March. Say hello!

tl;dr

  • I’m still alive.
  • This stuff is quite hard which probably means it will be worth a PhD in the long run.
  • I am still bad at blog.

  1. Sorry not sorry. 
  2. The video is just short of twelve minutes, but YouTube’s analytics tell me the average viewer gives up after 5 minutes 56 seconds. Which is less than ten seconds after I mention the next segment of the talk will contain statistics. Boo. 
  3. That is, I haven’t come up with a catchy, concise and witty acronym for it yet. 
]]>
https://samnicholls.net/2016/03/01/status-feb16/feed/ 1 558
Meet the Metahaplome https://samnicholls.net/2016/01/21/meet-the-metahaplome/ https://samnicholls.net/2016/01/21/meet-the-metahaplome/#comments Thu, 21 Jan 2016 20:59:48 +0000 https://samnicholls.net/?p=549 a talk at the Aberystwyth Bioinformatics Workshop on the metahaplome: a graph inspired structure for encoding the variation of single nucleotide polymorphisms (SNPs) observed across aligned sequenced reads.]]> Yesterday, I gave a talk at the Aberystwyth Bioinformatics Workshop on the metahaplome: a graph inspired structure for encoding the variation of single nucleotide polymorphisms (SNPs) observed across aligned sequenced reads. The talk is unintentionally lightning after I realised I had more slides than time and that I was all that stood between delegates and the pub, but it seemed to provide a good introduction to some of my work so far:

As a semi-interesting aside, I activated the workout mode on my Fitbit shortly before heading up to the podium to deliver my talk. My heart rate reached a peak of 162bpm and maintained an average of 126bpm. I was called to the stage ~5minutes into the “workout” where one can observe a rise and peak in heart rate before a slow and gentle decrease in heart rate as I become more comfortable during the talk and questions:

Fitbit Workout Graph during ABW2016 Talk

]]>
https://samnicholls.net/2016/01/21/meet-the-metahaplome/feed/ 1 549
Status Report: October 2015 https://samnicholls.net/2015/11/01/status-oct15/ https://samnicholls.net/2015/11/01/status-oct15/#comments Sun, 01 Nov 2015 19:30:25 +0000 http://samnicholls.net/?p=302 As is customary with any blog that I attempt to keep, I’ve somewhat fallen behind in providing timely updates and am instead hoarding drafts in various states of readiness. This was unhelped by my arguably ill thought out move to install WordPress and the rather painful migration that followed as a result. Now that the dust has mostly settled, I figured it might be nice to outline what I am actually working on before inevitably publishing a new epic tale of computational disaster.

The bulk of my work falls under two main projects that should hopefully sound familiar to those who follow the blog:

Metagenomes

I’ve now entered the second year of my PhD at Aberystwyth University, following my recent fries-and-waffle-fueled research adventure in Belgium. As a brief introduction to the uninitiated, I work in metagenomics: the study of all genetic sequences found in an environment. In particular, I’m interested in the metagenomes of microbial populations that have adapted to produce “interesting” enzymes (catalysts for chemical reactions). A few weeks ago, I presented a poster on the “metahaplome1 which is the culmination of my first year of work, to define and formalize how variation in sequences that produce these enzymes can be collected and organized.

DNA Quality Control

Over the summer, I returned to the Wellcome Trust Sanger Institute to continue some work I started as part of my undergraduate thesis. I’ve introduced the task previously and so will spare you the long winded description, but the project initially stalled due to the significant time and effort required to prepare part of the data set. During my brief re-visit, I picked up where I left off with the aim to complete the data set. You may have read that I encountered several problems along the way, and even when this mammoth task finally appeared complete, it was not. Shortly after arriving in Leuven, the final execution of the sample improvement pipeline was done. We’re ready to move forward with the analysis.

 

Side Projects

As is inevitable when you give a PhD to somebody with a short attention span, I have begun to accumulate some side projects:

SAMTools

The Sequence Alignment and Mapping Tools2 suite is a hugely popular open source bioinformatics tookit for interacting with sequencing data. During my undergraduate thesis I contributed a naive header parser to a project fork, that improved the speed of merges of large numbers of sequence files by several orders of magnitude. Recently, amongst a few small fixes here and there, I’ve added functionality to produce samtools stats output split by tags (such as @RG lines) and submitted a proposal to deprecate legacy samtools sort usage. With some time over upcoming holidays, I hope to finally contribute a proper header parser in time for samtools 1.4.

goldilocks

You may remember that I’d authored a Python package called goldilocks (YouTube: Goldilocks: Locating genomic regions that are “just right”, 1st RSG UK Symposium, Oct 2014) as part of my undergraduate work, to find a “just right” 1Mbp region of the human genome that was “representative” in terms of variation expressed. Following some tidying and much optimisation, it’s now a proper package, documented, and I’m now waiting to hear feedback on the submission of my first paper.

sunblock

You may have noticed my opinion on Sun Grid Engine, and the trouble I have had in using it at scale. To combat this, I’ve been working on a small side project called sunblock: a Python command line tool that encapsulates the submission and management of cluster jobs via a more user-friendly interface. The idea is to save anybody else from ever having to use Sun Grid Engine ever again. Thanks to a night in Belgium where it was far too warm to sleep, and a little Django magic, sunblock acquired a super-user-friendly interface and database backend.

Blog

This pain in the arse blog.


tl;dr

  • I’m still alive
  • I’m still working
  • Blogs are hard work

  1. Yes, sorry, it’s another -ome. I’m hoping it won’t find its way on to Jonathan Eisen’s list of #badomes
  2. Not to be confused with a series of tools invented by me, sadly. 
]]>
https://samnicholls.net/2015/11/01/status-oct15/feed/ 1 302