As is customary with any blog that I attempt to keep, I’ve somewhat fallen behind in providing timely updates and am instead hoarding drafts in various states of readiness. This was unhelped by my arguably ill thought out move to install WordPress and the rather painful migration that followed as a result. Now that the dust has mostly settled, I figured it might be nice to outline what I am actually working on before inevitably publishing a new epic tale of computational disaster.
The bulk of my work falls under two main projects that should hopefully sound familiar to those who follow the blog:
I’ve now entered the second year of my PhD at Aberystwyth University, following my recent fries-and-waffle-fueled research adventure in Belgium. As a brief introduction to the uninitiated, I work in metagenomics: the study of all genetic sequences found in an environment. In particular, I’m interested in the metagenomes of microbial populations that have adapted to produce “interesting” enzymes (catalysts for chemical reactions). A few weeks ago, I presented a poster on the “metahaplome“1 which is the culmination of my first year of work, to define and formalize how variation in sequences that produce these enzymes can be collected and organized.
DNA Quality Control
Over the summer, I returned to the Wellcome Trust Sanger Institute to continue some work I started as part of my undergraduate thesis. I’ve introduced the task previously and so will spare you the long winded description, but the project initially stalled due to the significant time and effort required to prepare part of the data set. During my brief re-visit, I picked up where I left off with the aim to complete the data set. You may have read that I encountered several problems along the way, and even when this mammoth task finally appeared complete, it was not. Shortly after arriving in Leuven, the final execution of the sample improvement pipeline was done. We’re ready to move forward with the analysis.
As is inevitable when you give a PhD to somebody with a short attention span, I have begun to accumulate some side projects:
The Sequence Alignment and Mapping Tools2 suite is a hugely popular open source bioinformatics tookit for interacting with sequencing data. During my undergraduate thesis I contributed a naive header parser to a project fork, that improved the speed of merges of large numbers of sequence files by several orders of magnitude. Recently, amongst a few small fixes here and there, I’ve added functionality to produce
samtools stats output split by tags (such as
@RG lines) and submitted a proposal to deprecate legacy
samtools sort usage. With some time over upcoming holidays, I hope to finally contribute a proper header parser in time for
You may remember that I’d authored a Python package called
goldilocks (YouTube: Goldilocks: Locating genomic regions that are “just right”, 1st RSG UK Symposium, Oct 2014) as part of my undergraduate work, to find a “just right” 1Mbp region of the human genome that was “representative” in terms of variation expressed. Following some tidying and much optimisation, it’s now a proper package, documented, and I’m now waiting to hear feedback on the submission of my first paper.
You may have noticed my opinion on Sun Grid Engine, and the trouble I have had in using it at scale. To combat this, I’ve been working on a small side project called
sunblock: a Python command line tool that encapsulates the submission and management of cluster jobs via a more user-friendly interface. The idea is to save anybody else from ever having to use Sun Grid Engine ever again. Thanks to a night in Belgium where it was far too warm to sleep, and a little Django magic,
sunblock acquired a super-user-friendly interface and database backend.
This pain in the arse blog.
- I’m still alive
- I’m still working
- Blogs are hard work
- Yes, sorry, it’s another -ome. I’m hoping it won’t find its way on to Jonathan Eisen’s list of #badomes. ↩
- Not to be confused with a series of tools invented by me, sadly. ↩