Samposium https://samnicholls.net The Exciting Adventures of Sam Mon, 10 Jan 2022 13:17:22 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 Duplicati: not a valid Win32 FileTime https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/ https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/#respond Mon, 10 Jan 2022 12:47:12 +0000 https://samnicholls.net/?p=2451 I diagnosed this by using the “Live” log, starting the broken backup again and observing the error is thrown after the Backend event: List log entry. I’m using samba to backup parts of my Windows machine to a server in my house using the samba protocol. From the server I listed the target backup directory with ls -lutr (list, using access time, sort and reverse); and immediately noticed several blocks from the last successful backup had an access time from the year 30828! I touched the affected files to give them a more sensible access time and the backup from my Windows machine was able to complete successfully. If I ever discover how this happened in the first place, I will update this post. In the meantime, I hope this is helpful to someone else as searching for help on the issue only raised old problems with local file backups. ]]> https://samnicholls.net/2022/01/10/duplicati-not-a-valid-win32-filetime/feed/ 0 2451 Quick fix for Crashplan Linux segfault https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/ https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/#respond Mon, 20 May 2019 08:32:44 +0000 https://samnicholls.net/?p=2401 Crashplan as your backup solution, it’s likely because it is one of the only companies that make backing up from Linux straightforward (even if they did shaft all their home customers and force them onto business accounts). However part of the Linux Crashplan experience appears to be encountering – from time to time – that the backup engine has stopped working and the application segfaults on start. The segfault is caused by updating glibc, which as a good administrator you are probably doing by occasionally running update on your package manager du jour. The update borks a standard compiled library that the Crashplan electron app needs, namely libnode.so. It was a pain to work this out because the client’s logging is pretty rubbish. Luckily, the fix turns out to be pretty simple. Just install an electron app that isn’t a pile of garbage; like Github’s Atom editor, then copy its libnode.so to your Crashplan’s directory:
cd /usr/local/crashplan/electron
sudo mv libnode.so libnode.so.bork
sudo cp /usr/share/atom/libnode.so .
Of course, your locations may vary, but this will allow the application to start. ]]>
https://samnicholls.net/2019/05/20/quick-fix-for-crashplan-linux-segfault/feed/ 0 2401
Status Report: 2018: The light is at the end of the tunnel that I continue to build https://samnicholls.net/2018/01/15/status-jan18-p1/ https://samnicholls.net/2018/01/15/status-jan18-p1/#respond Mon, 15 Jan 2018 20:39:54 +0000 https://samnicholls.net/?p=2254 Happy New Year!
The guilt of not writing has reached a level where I feel sufficiently obligated to draft a post. You’ll likely notice from the upcoming contents that I am still a PhD student, despite a previous, more optimistic version of myself writing that 2016 would be my final Christmas as a PhD candidate.

Much has happened since my previous Status Report, and I’m sure much of it will spin-off to form several posts of their own, eventually. For the sake of brevity, I’ll give a high level overview.
I’m supposed to be writing a thesis anyway.


Previously on…

We last parted ways with a doublebill status report lamenting the troubles of generating suitable test data for my metagenomic haplotype recovery algorithm, and documenting the ups-and-downs-and-ups-again of analysing one of the synthetic data sets for my pre-print. In particular, I was on a quest to respond to our reviewer’s desire for more realistic data: real reads.

Gretel: Now with real reads!

Part Two of my previous report alluded to a Part Three that I never got around to finishing, on the creation and analysis of a test data set consisting of real reads. This was a major concern of the reviewers who gave feedback on our initial pre-print. Without getting into too much detail (I’m sure there’s time for that); I found a suitable data set consisting of real sequence reads from a lab-mix of five HIV strains, used to benchmark algorithms in the related problem of viral-quasispecies reconstruction. After fixing a small bug, and implementing deletion handling, it turns out we do well on this difficult problem. Very well.

In the same fashion as our synthetic DHFR metahaplome, this HIV data set provided five known haplotypes, representing five different HIV-1 strains. Importantly, we were also provided with real Illumina short-reads from a sequencing run containing a mix of the five known strains. This was our holy grail, finally: a benchmark with sequence reads and a set of known haplotypes. Gretel is capable of recovering long, highly variable genes with 100% accuracy. My favourite result is a recovery of env — the ridiculously hyper-variable envelope gene that encodes the HIV-1 virus’ protein shell — with Gretel correctly recovering all but one of 2,568 positions. Not bad.

A new pre-print

Armed with real-reads, and improved results for our original DHFR test data (thanks to some fiddling with bowtie2), we released a new pre-print. The manuscript was a substantial improvement over its predecessor, which meant it was all the more disappointing to be rejected from five different journals. But, more on this misery at another time.

Despite our best efforts to address the previous concerns, new reviewers felt that our data sets were still not a good representation of the problem-at-hand: “Where is the metagenome?”. It felt like the goal-posts had moved, suddenly real reads were not enough. But it’s both a frustrating and fair response, work should be empirically validated, but there are no metagenomic data sets with both a set of sequence reads, and known haplotypes. So, it was time to make one.

I’m a real scientist now…

And so, I embarked upon what would become the most exciting and frustrating adventure of my PhD. My first experiences of the lab as a computational biologist is a post sat in draft, but suffice to say that the learning curve was steep. I’ve discovered that there are many different types of water and that they all look the same, that 1ml is a gigantic volume, that you’ll lose your fingerprints if you touch a metal drawer inside a -80C freezer, and that contrary to what I might have thought before, transferring tiny volumes of colourless liquids between tiny tubes without fucking up a single thing, takes a lot of time, effort and skill. I have a new appreciation for the intricate and stochastic nature of lab work, and I understand what it’s like for someone to “borrow” a reagent that you spent hours of your time to make from scratch. And finally, I had a legitimate reason to wear an ill-fitting lab coat that I purchased in my first year (2010), to look cool at computer science socials.

With this new-found skill-tree to work on, I felt like I was becoming a proper interdisciplinary scientist, but this comes at a cost. Context switching isn’t cheap, and I was reminded of my undergraduate days where I juggled mathematics, statistics and computing to earn my joint honours degree. I had more lectures, more assignments and more exams than my peers, but this was and still is the cost of my decision to become an interdisciplinary scientist.

And it was often difficult to find much sympathy from either side of the venn diagram…

..and science can be awful

I’ve suffered many frustrations as a programmer. One can waste hours tracking down a bug that turns out to be a simple typo, or more likely, an off by one error that plagues much of bioinformatics. I’ve felt the self-directed anger having submitted thousands of cluster jobs that have failed with a missing parameter, or waited hours for a program to complete, only to discover the disk has run out of room to store the output. Yet, these problems pale into comparison in the face of problems at the bench.

I’ve spent days in the lab, setting-up and executing PCR, casting, loading and running gels, only to take a UV image of absolutely nothing at all.

Last year, I spent most of Christmas sheparding data through our cluster, much to my family’s dismay. This year, I had to miss a large family do for a sister’s milestone birthday. I spent many midnights in the lab, lamenting the life of a PhD student, and shuffling around with angry optimism; “Surely it has to fucking work this time?”. Until finally, I got what I wanted.

I screamed so loud with glee that security came to check on me. “I’m a fucking scientist now!”

New Nanopore Toys

My experiment was simple in practice. Computationally, I’d predicted haplotypes with my Gretel method from short-read Illumina data from a real rumen microbiome. I designed 10 pairs of primers to capture 10 genes of interest (with hydrolytic-activity) using the haplotypes. And finally, after several weeks of constant almost 24/7 lab work, building cDNA libraries and amplifying the genes of interest, I made enough product for the exciting next step: Nanopore sequencing.

With some invaluable assistance from our resident Nanopore expert Arwyn Edwards (@arwynedwards) and PhD student André (@GeoMicroSoares), I sequenced my amplicons on an Oxford Nanopore MinION, and the results were incredible.

Our Nanopore reads strongly supported our haplotypes, and concurred with the Sanger sequencing. Finally, we have empirical biological evidence that Gretel works.

The pre-print rises

With this bomb-shell in the bag, the third version of my pre-print rose from the ashes of the second. We demoted the DHFR and HIV-1 data sets to the Supplement, and included an analysis on our performance with a de facto benchmark mock community introduced by Chris Quince in its place. The data sets and evaluation mechanisms that our previous reviewers found unrepresentative and convoluted were gone. I even got to include a Circos plot.

Once more, we substantially updated the manuscript, and released a new pre-print. We made our to bioRxiv to much Twitter fanfare, earning over 1,500 views in our first week.

This work also addresses every piece of feedback we’ve had from reviewers in the past. Surely, the publishing process would now finally recognise our work and send us out for review, right?

Sadly, the journey of this work is still not smooth sailing, with three of my weekends marred by a Friday desk rejection…

…and a fourth desk rejection on the last working day before Christmas was pretty painful. But we are currently grateful to be in discussion with an editor and I am trying to remain hopeful we will get where we want to be in the end. Wish us luck!


In other news…

Of course, I am one for procrastination, and have been keeping busy while all this has been unfolding…

I hosted a national student conference

I am applying for some fellowships

I’ve officially started my thesis…

…which is just as well, because the money is gone

I’ve started making cheap lab tat with my best friend…

…it’s approved by polar bears

…and the UK Centre for Astrobiology

…and has been to the Arctic

I gave an invited talk at a big conference…

…it seemed to go down well

I hosted UKIEPC at Aber for the 4th year

We’ve applied to fund Monster Lab…

…and made a website to catalogue our monsters

For a change I chose my family over my PhD and had a fucking great Christmas


What’s next?

  • Get this fucking great paper off my desk and out of my life
  • Hopefully get invited to some fellowship interviews
  • Continue making cool stuff with Sam and Tom Industrys
  • Do more cool stuff with Monster Lab
  • Finish this fucking thesis so I can finally do something else

tl;dr

  • Happy New Year
  • For more information, please re-read
]]>
https://samnicholls.net/2018/01/15/status-jan18-p1/feed/ 0 2254
We made a Lego DNA Sequencer to teach kids about DNA, Sequencing and Phenotypes https://samnicholls.net/2017/03/15/lego-sequencer/ https://samnicholls.net/2017/03/15/lego-sequencer/#respond Wed, 15 Mar 2017 23:24:23 +0000 https://samnicholls.net/?p=2217 Every year the university is host to over a thousand primary school pupils as part of British Science Week. Last year you may remember I ran an activity that taught you how to be a sequence aligner through the medium of Lego. Describing what I’d like to do differently in future, I included the following:

To improve, my idea would be to get participants to construct a short genome out of Lego pieces that can be truly “sequenced” by pushing it through some sort of colour sensor or camera apparatus attached to an Arduino inside a future iteration of the trusty SAMTECH Sequencer range.

So I’d like to introduce the Sam and Tom Industrys Legogen Sequenceer(TM) 9001:

The Sequencer

The Lego “Sequenceer” has a tray that holds a small stack of Lego bricks (up to 12). Once loaded, a toothed side to the tray is pulled along a track with a gear attached to a stepper motor that was commandeered from an old Epson printer. An RGB colour sensor sits just over the track on a small acrylic bridge, in very close proximity to the passing bricks. The stepper is programmed to actuate a calibrated number of steps per brick, stopping to allow the RGB sensor to take a reading of a brick’s colour before moving to the next.

Our software then translates these readings to one of the four DNA bases and prints the result on the serial port for the user to see. Once complete, the stepper mode is reversed and the tray is “ejected” such that it protrudes once more out the front of the machine to allow for unloading and loading of the next Lego brick DNA sequence.

The Activity

My previous activity tasked pupils to align pre-constructed Lego DNA sequences to each-other. The goal was to introduce the idea of short read sequencing and how alignment of a short sequence is hard, let alone that of an unexpectedly-long human genome (highest guesses for base pair length were in the thousands). In general, the students enjoyed, but I felt I could massively improve the task if we used Lego for its intended purpose: construction.

This year, after quizzing pupils on what they know about DNA, I once again introduce the concept of using 2×2 blocks of Lego as DNA bases. We have four colours, matching the four bases of DNA we see in our own genomes. I explain that today we will be construct a 10 base DNA sequence, representing the entire genome of a monster, and that we can find out what that genome is with our sequencer.

I designed some official looking forms onto which the pupils could copy out the base calls for each of the ten bases on their monster’s genome, which all happen to be a gene that controls an interesting phenotype:

Our young scientists then look at our “monster datasheet”, that decodes each of the genes and what the phenotype for each of the alleles will be:

Now given their sequenced monster genome, and the datasheet, we ask the pupils to make their monster’s phenotype a reality – draw their monster to add to our records! The hardest part? Naming their newly discovered monster. Finally they add their name and school, to credit them with their discovery. Here was our first of the day. I stole our department’s stamp for an additional feeling of official-ness:

Conclusion

I’m so pleased with how this project has turned out. Not only is the sequencer itself the coolest electronics project I’ve had a hand in, but the reaction to the stand on the day from children and adults alike was so positive. The pupils loved the concept: getting to build something with Lego, and then drawing a cool monster that has special powers is clearly a recipe for success. It is rewarding to see that they’ve enjoyed the activity, as well as to know they can now tell you something about DNA and phenotypes too.

This is a significant improvement on my stand from last year. We get to look at bases and DNA, sequencing and importantly, how changes in bases correspond to changes in a phenotype: which makes us who we are. The activity tells much more of a story, rather than just aligning Lego bricks to explain that a hard problem is indeed hard. Here, we get to physically construct a tangible proxy to a genome and find out something about it through an experiment. Finally we add the result to the sum of knowledge of the subject of monsters discovered so far. My wall is now full of unique and interesting monster records, each drawn with some artistic license from the scientist in charge. Though, despite the large sequence space, I was amused to note the bias for 8 legged and 8 eyed monsters due to an environmental pressure for the selection of the T nucleotide: the colour blue.

Once again, I’ve had much more fun than my regular grumpy-self was expecting. I’ve been reminded that public engagement can be very rewarding.

Engage engagement

Some pictures of the stand from the day:

Finally, here are the images from our inaugural Monster Lab:

Monster Lab: Aber Uni Science Week 2017

tl;dr

  • Tom and I upgraded the old SAMTECH 9000 to create the Arduino-powered self-scanning Legogen Sequenceer 9001
  • Spending the entire day with children continues to be absolutely exhausting
  • Public engagement continues to be rewarded
  • Lego is an excellent way to introduce the concept of DNA, and genomes
  • Blue is a particularly popular favourite colour
  • Primary school children aren’t as dumb as everyone says they are
  • Packs of genuine 2×2 Lego bricks are really bloody expensive
]]>
https://samnicholls.net/2017/03/15/lego-sequencer/feed/ 0 2217
Building a Beehive Monitor: A Prototype https://samnicholls.net/2017/02/28/beehive-monitor-p1/ https://samnicholls.net/2017/02/28/beehive-monitor-p1/#respond Tue, 28 Feb 2017 00:31:11 +0000 https://samnicholls.net/?p=2148 Ever since I got my own beehive last summer, I’ve been toying with the idea about cramming it full of interesting sensors to report on the status of my precious hive. During Winter, it is too cold to perform inspections and since my final inspection last October I’ve grown increasingly curious and anxious as to whether the bees had survived. Last week, I finally got a spell of reasonable weather and I’m happy to report that the bees are doing well.

It has been some time since I’d last seen inside the hive. During and following the inspection, I was once again inspired to find a way to monitor their progress. Although primarily due of self-interest as I find the hive fascinating, monitoring a hive poses both some interesting technical problems, and a platform on which to actually quantify their status. Such information could become much more valuable in future when we have more hives, as it would be possible to establish a baseline from which to detect for anomalies early. But again, mostly because beehives are cool.

Proof of Concept Monitor and Transmitter

I’d been discussing these ideas with Tom, who you might remember from the lightstick project as being the half of the newly created Sam and Tom Industrys venture, who is actually capable of making things with electronics.

We began assembling a quick proof of concept. Handily, Tom had a bunch of DHT22 temperature and humidity sensors lying around which provided a good start. Regarding a microcontroller, I’ve wanted an excuse to play with radio communication for some time, and this was a nice opportunity to try out a pair of Moteino R4s. We also discussed a future idea to use Bluetooth because the idea of pairing my phone with a beehive during an inspection seemed both amusing, and a useful way to download data that is less practical to transmit over radio to a device.

We set to work. Tom retrofitted some code he had previously written to manage wireless comms between Moteinos, while I cut and badly stripped some wires. After some breadboard battling, we had a working prototype in quarter of an hour. Not content with this, we attached the sensors to lengths of flat cable, added a TEMT6000 ambient light sensor and a NeoPixel ring for no particular reason other than the fact we love NeoPixels so much.

Base Station Prototype Receiver

We verified that the packets containing the sensor data were being broadcast. The Moteino queries the sensors and loads the results into a struct which is then transmitted over radio as binary by the RMF69 library. Now we needed a base station to pick up the transmissions and do something interesting.

We already had packet handling and receiving code for the second Moteino, but as we’d decided (for the time being at least) that the endpoint for the sensor data was to be hosted on the internet, we needed WiFi. Enter our WiFi enabled friend: the ESP8266.

We needed to establish a software serial link between the Moteino and ESP8266, a trivial use-case that can be solved by the Arduino SoftwareSerial library. I did encounter some familiar confusion arising from our trusty pinout diagram, connecting the marked RX and TX pins on the NodeMCU to the Moteino interfered with my ability to actually program it. We instead set the RX/TX pins to the unused D1 and D2 respectively.

The receiving Moteino captures each transmitted packet, reads the first byte to determine its type (we’ve defined only one, but it’s good to future proof early) and if it is the right size, passes the binary data through its serial interface to the ESP8266 before returning an acknowledgement to the transmitting Moteino.

The NodeMCU uses the ESP8266WiFi library to establish a connection to the SSID defined in our configuration file. The ESP8266 Arduino repository serves as both a helpful lookup of the source, and typically features examples of each library.

Once connection to the designated SSID has been negotiated, the controller listens to the software serial receive pin for data. When available, the first byte is read (as before on the Moteinos) to determine the packet type. If the packet type is valid, we listen for as many bytes as necessary to fill the struct for that particular packet type. We ran in to a lot of head scratching here, eventually discovering that our packet structs were larger than calculated, as uint8_t‘s are packed with 3 leading bytes to align them to the architecture of the NodeMCU. This caused us almost an hour of debugging many off-by-some-bytes oddities in our packet handling.

Once a valid packet’s binary stream has been loaded into the appropriate struct, we translate it to a JSON string via the rather helpful ArduinoJSON library. As the ESP8266HTTPClient requires a char* payload, I wasn’t sure of the correct way to actually transmit the JSONObject, so I rather unpleasantly abused its printTo function to write it into a fixed size char array:

[...]
    StaticJsonBuffer<256> jsonBuffer;
    JsonObject& json = jsonBuffer.createObject();

    jsonifyPayload(json, payload, pkt_type);
        
    http.begin(endpoint, fingerprint);
    http.addHeader("Content-Type", "application/json");

    char msg[256];
    json.printTo(msg);
    Serial.print(msg);

    int httpCode = http.POST(msg);
[...]

Note the fingerprint in the http.begin: a required parameter containing the SHA1 fingerprint of the endpoint address if you are attempting to use HTTPS. Without this, the library will just refuse to make the connection. It would have been trivial to diagnose this earlier if we could actually work out how to properly enable debug messages from the ESP8266HTTPClient library. Despite our best attempts, we had to force the debug mechanism to work by manually editing away the guards around the #define DEBUG_HTTPCLIENT header, in the installed packages directory. No idea why. Grim.

Other time wasted involved a lot of tail chasing given after what appeared to be the delivery of a successful POST request, a httpCode of -5. It would later turn out that the Python SimpleHTTPServer example I lazily downloaded from the web failed to set the headers of the response to the POST request, causing the ESP8266HTTPClient to believe the connection to the server was lost (an all-else-fails assumption it makes if it is unable to parse a response from the server). The -5 refers to a definition in the library header: #define HTTPC_ERROR_CONNECTION_LOST (-5).

Finally, we received HTTP_CODE_OK responses back from our test server, which happily dumped our JSON response to the terminal on my laptop…. After I remembered to join the WiFi network we’d connected the NodeMCU to, at least.

The Web Interface: Back from the Dead

We have established end-to-end communications and confirmed that sensor data can find its way to a test server hosted on the local network. The final piece of the puzzle was to throw up a website designed to receive, store and present the data. Somewhat fortuitously, a project that I had begun and abandoned in 2014 seemed to fit the bill: Poisson, a flask-fronted, redis-backed, schema-less monitoring system. Despite not looking at it in almost three years, I felt there might be some mileage in revising it for our purpose.

Initially Poisson was designed to monitor the occurrence of arbitrary events, referred to by a key. A GET request with a key, would cause Poisson to add a timestamped entry to the redis database for that key. For example, one could count occurrences of passengers, customers, sales or other time sensitive events by emitting GET requests with a shared key to the Poisson server. The goal was to generate interesting Poisson distribution statistics from the timestamped data, but I never get as far as the fun mathematics. The front end made use of the hip-at-the-time socket.io framework, capable of pushing newly detected events in real time to connected clients, and drawing some pretty pretty graphs using d3 and cubism.

After some quick tweaks that allowed requests to also specify a value for the latest event (instead of assuming 1) and provided an API endpoint for POST requests that wrapped the mechanism of the original GET for each key:value pair in the JSON data, we were ready to go.

Until I discovered all this newfangled websocket nonsense didn’t seem to play well with apache in practice. After much frustration, the best solution appeared to involve installing gevent with pip on the server, which according to the flask-socketio documentation for deployment would cause flask-socketio to replace the development server (typically inappropriate for production use) with an embedded server capable of working in prod. I could then invoke the application myself to bind it to some local port. After some fiddling, I found the following two lines in my apache config were all that was needed to correctly proxy the locally-run Poisson application to my existing webserver.

ProxyPass http://localhost:5000/
ProxyPassReverse  http://localhost:5000/

Testing the Prototypes (not in a hive)

Clearly we’re not hive ready. We’ve planted the current version of the hardware at Tom’s house (to avoid having to deal with getting the ESP8266 to work with eduroam‘s 802.1x authentication) and had the server running long enough to collect over 100,000 events and iron out any small issues. We’re currently polling approximately every 30 seconds, and the corresponding deployment of the Poisson interface is configured to show over 24 hours of data at a resolution of 2 minutes, and the past 90 minutes at a resolution of 30 seconds.

All in all I’m pretty pleased with what we’ve managed to accomplish in around two days of half baked work, and the interface is probably the shiniest piece of web I’ve had the displeasure of making since Triage.

Conclusion

We’ve collected over 100,000 events with little issue. I’ve ordered some additional sensors of interest, on the way are some sound sensors, and infrared sensors. The latter I shall be using in an attempt to count (or at least monitor) bee in/out activity at the hive entrance.

It won’t be until nearer the middle-to-end of March that the weather will likely pick up enough to mess around with the hive at length. Though we must bee proof the sensors and construct a weather proof box for the transmitting Moteino before then. You can also look at the microcontroller code in our Telehive repository, and the Poisson flask application repository.

In the meantime, go spy on Tom’s flat live.

tl;dr

  • Moteinos are awesome, low power radio communications are cool
  • The Arduino IDE can be a monumental pain in the ass sometimes
  • We had to manually uncomment lines in the headers for the ESP8266HTTPClient library to enable debugging, we still cannot fathom why
  • You need to specify the SHA1 fingerprint of your remote server as a second parameter to http.begin if you want to use HTTPS with ESP8266HTTPClient
  • Your crap Python test server should set the headers for its response, else your ESP8266HTTPClient connection will think it has failed
  • ESP8266HTTPClient uses its own response codes that are #define‘d in its headers, not regular HTTP codes
  • uint8_t‘s are packed with 3 bytes to align them on the NodeMCU ESP8266 architecture
  • This kind of stuff continues to be fraught with frustration, but is rather rewarding when it all comes together
]]>
https://samnicholls.net/2017/02/28/beehive-monitor-p1/feed/ 0 2148
Laser Etching my XPS 13 Laptop: Orphan Black style https://samnicholls.net/2017/01/30/laptop-laser-engrave/ https://samnicholls.net/2017/01/30/laptop-laser-engrave/#respond Mon, 30 Jan 2017 23:58:32 +0000 https://samnicholls.net/?p=2091 Earlier today I got pretty cross with bioinformatics and our current state of tooling:

In an attempt to make myself feel better, I decided to stick my laptop in a laser cutter to give it a makeover1. Here is the exciting adventure-cross-how-to:

Pew pew, I stuck my laptop in a laser cutter

Although I had faith that this wouldn’t be a disaster (despite my screaming while pressing the Start button), but I’m actually pretty surprised with the quality of the result, which was beyond my expectations. Tom and I plan to investigate using the laser for engraving all sorts of other odds and ends in future.


  1. You should recognise the pattern if you’re an Orphan Black fan! Cosima and I can be laptop buddies. 
]]>
https://samnicholls.net/2017/01/30/laptop-laser-engrave/feed/ 0 2091
Status Report: November 2016 (Part II): Revisiting the Synthetic Metahaplomes https://samnicholls.net/2016/12/24/status-nov16-p2/ https://samnicholls.net/2016/12/24/status-nov16-p2/#respond Sat, 24 Dec 2016 01:19:57 +0000 https://samnicholls.net/?p=1916 In the opening act of this status report, I described the abrupt and unfortunate end to the adventure of my pre-print. In response to reviewer feedback, I outlined three major tasks that lie ahead in wait for me, blocking the path to an enjoyable Christmas holiday, and a better manuscript submission to a more specialised an alternative journal:

  • Improve Triviomes
    We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.
  • Polish DHFR and AIMP1 analysis
    One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the DHFR and AIMP1 data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.

  • Real Reads
    Create and analyse a data set consisting of real reads.

The first part of this report covers my vented frustrations with both generating, but particularly analysing the new “Treeviomes” in depth. This part will now turn focus to the sequel of my already existing DHFR analysis, addressing the reasoning behind a particularly crap set of recoveries and reassuring myself that perhaps everything is not a disaster after all.

Friends and family, you might be happy to know that I am less grumpy than last week, as a result of the work described in this post. You can skip to the tldr now.

DHFR II: Electric Boogaloo

Flashback

Our preprint builds upon the initial naive testing of Hansel and Gretel on recovery of randomly generated haplotypes, by introducing a pair of metahaplomes that each contain five haplotypes of varying similarity of a particular gene; namely DHFR and AIMP1.

Testing follows the same formula as we’ve discussed in a previous blog post:

  • Select a master gene (an arbitrary, DHFR, in this case)
  • megaBLAST for similar sequences and select 5 arbitrary sequences of decreasing identity
  • Extract the overlapping regions on those sequences, these are the input haplotypes
  • Generate a random set of (now properly) uniformly distributed reads, of a fixed length and coverage, from each of the input haplotypes
  • Use the master as a pseudo-reference against which to align your synthetic reads
  • Stand idly by as many of your reads are hopelessly discarded by your favourite aligner
  • Call for variants on what is left of your aligned reads by assuming any heterogeneous site is a variant
  • Feed the aligned reads and called variant positions to Gretel
  • Compare the recovered haplotypes, to the input haplotypes, with BLAST

As described in our paper, results for both the DHFR and AIMP1 data sets were promising, with up to 99% of SNPs correctly recovered to successfully reconstruct the input haplotypes. We have shown that haplotypes can be recovered from short reads that originate from a metagenomic data set. However, I want to revisit this analysis to improve the story and fix a few outstanding issues:

  • A reviewer assumed lower recovery rates on haplotypes with poorer identity to the pseudo-reference was purely down to bad decision making by Gretel, and not to do with the long paragraph about how this is caused by discarded reads
  • A reviewer described the tabular results as hard to follow
  • I felt I had not provided a deep enough analysis to readers into the effects of read length and coverage, although no reviewer asked for this, it will help me sleep at night
  • Along the same theme, the tables in the pre-print only provided averages, whereas I felt a diagram might better explain some of the variance observed in recovery rates
  • Additionally, I’d updated the read generator as part of my work on the “Treeviomes” discussed in my last post, particularly to generate more realistic looking distributions of reads, I needed to check results were still reliable and thought it would be helpful if we had a consistent story between our data sets

Progress

Continuing the rampant pessimism that can be found in the first half of this status report, I wanted to audit my work on these “synthetic metahaplomes”, particularly given the changed results between the Triviomes, and more biologically realistic and tangible Treeviomes. I wanted to be sure that we had not departed from the good results introduced in the DHFR and AIMP1 sections of the paper.

To be a little less gloomy, I was also inspired by the progress I had made on the analysis of the Treeviomes. I felt we could we could try and uniformly present our results and use box diagrams similar to those that I had finally managed to design and plot at the end of my last post. I feel these diagrams provide a much more detailed insight to the capabilities of Gretel given attributes of read sets such as length and coverage. Additionally, we know each of the input haplotypes, which unlike the uniformly varying sequences of our Treeviomes, have decreasing similarity to the pseudo-reference; thus we can use such a plot to describe Gretel‘s accuracy with respect to each haplotype’s distance from the pseudo-reference (somewhat akin to how we present results for different values of per-haplotype per-base sequence mutation rates on our Treeviomes).

So here’s one I made earlier:

Sam, what the fuck am I looking at?

  • Vertical facets are read length
  • Horizontal facets represent each of the five input haplotypes, ordered by decreasing similarity to the chosen DHFR pseudo-reference
  • X-axis of each boxplot is the average read coverage for each of the ten generated read sets
  • Y-axis is the average best recovery rate for a given haplotype, over the ten runs of Gretel: each boxplot summarising the best recoveries for ten randomly generated uniform distributed sets of reads (with a set read length and coverage)
  • Top scatter plot shows number of called variants across the ten sets of reads for each length-coverage parameter pair
  • Bottom scatter plot shows for each of the ten read sets, the proportion of reads from each haplotype that were dropped during alignment

What can we see?

  • Everything is not awful: three of the five haplotypes can be recovered to around 95% accuracy, even with very short reads (75-100bp), and reasonable average per-haplotype coverage.
  • Good recovery of AK232978, despite a 10% mutation rate when compared to the pseudo-reference: Gretel yielded haplotypes with 80% accuracy.
  • Recovery of XM_012113510 is particularly difficult. Even given long reads (150-250bp) and high coverage (50x) we fail to reach 75% accuracy. Our pre-print hypothesiszed that this was due to its 82.9% identity to the pseudo-reference causing dropped reads.

Our pre-print presents results considering just one read set for each of the length and coverage parameter pairs. Despite the need to perhaps take those results with a pinch of salt, it is somewhat relieving to see that our new more robust experiment yields very similar recovery rates across each of the haplotypes. With the exception of XM_012113510.

The Fall of “XM”

Recovery rates for our synthetic metahaplomes, such as DHFR, are found by BLASTing the haplotypes returned by Gretel against the five known, input haplotypes. The “best haplotype” for an input is defined as the output haplotype with the fewest mismatches against the given input. We report the recovery rate by assuming the mismatches are the incorrectly recovered variants, and divide this by the number of variants called over the read set that yielded this haplotype.

We cannot simply use sequence identity as a score: consider a sequence of 100bp, if there are just 5 variants, all of which are incorrectly reconstructed, Gretel will still appear to be 95% accurate, rather than correctly reporting 0% (1 – 5/5). Note however, that our method does somewhat assume that variants are called perfectly: remember that homogenous sites are ignored by Gretel, so any “missed” SNPs, will always mismatch the input haplotypes, as we “fill in” homogeneous sites when outputting the output haplotype FASTA, with the nucleotides from the pseudo-reference.

XM_012113510 posed an interesting problem for evaluation. When collating results for the pre-print, I found the corresponding best haplotype failed to align the entirity of the gene (564bp) and instead yielded two high scoring, non-overlapping hits. When I updated the harness that calculates and collates recovery rates across all the different runs of Gretel, I overlooked the possibility for this and forced selection of the “best” hit (-max_hsps 1). Apparently, this split-hit behaviour was not a one-off, but almost universal across all runs of Gretel over the 640 sets of reads.

In fact, the split-hit was highly precise. Almost every best XM_012113510 haplotype had a hit along the first 87bp, and another starting at (or very close to) 135bp, stretching across the remainder of the gene. My curiosity piqued, I fell down another rabbit hole.

Curious Coverage

I immediately assumed this must be an interesting side-effect of read dropping. I indicated in our pre-print that we had difficulty aligning reads against a pseudo-reference where those reads originate from a sequence that is dissimilar (<= 90% identity) from the reference. This read dropping phenomenon has been one of the many issues our lab has experiened with current bioinformatics tooling when dealing with metagenomic data sets.

Indeed, our boxplot above features a colour-coded scatter plot that demonstrates that the two most dissimilar input haplotypes AK and XM, are also the haplotypes that experience the most trouble during read alignment. I suggested in the pre-print that thes dropped reads are the underlying reason for Gretel‘s performance on those more dissimilar haplotypes.

I wanted to see empirically whether this was the case, and to potentially find a nice way of providing evidence, given that one of our reviewers missed my attribution of these poorer AK and XM recoveries to trouble with our alignment step, rather than incompetence on Gretel‘s part. I knocked up a small Python script that deciphered my automatically generated read names, and calculated per-haplotype read coverage, for each of the five input haplotypes, across the 640 alignments.

  • X-axis represents genomic position along the psuedo-reference (the “master” `DHFR gene)
  • Y-axis labels each of the input haplotype, starting with the most dissimilar (XM) and ending on the most similar (BC)
  • The coloured heatmap plots the average coverage for a given haplotype at a given position, across all generated BAM files

Well look what we have here… Patches of XM and AK are most definitely poorly supported by reads! This is troubling for their recovery, as Hansel stores pairwise evidence of variants co-occurring on the same read. If those reads are discarded, we have no evidence of that haplotype. It should be no surprise that Gretel has difficultly here. We can’t create evidence from nothing.

What’s this?… A crisp, ungradiated, dark navy box representing absolutely 0 reads across any of the 640 alignments, sitting on our XM gene. If I had to guesstimate the bounds of that box on the X-axis, I would bet my stipend that they’ll be a stonesthrow from 87 and 135bp… The bounds of the split BLAST hits we were reliably generating against XM_012113510, almost 640 times. I delved deeper. What did the XM reads over that region look like? Why didn’t they align correctly? Where they misaligned, or dropped entirely?

But I couldn’t find any.

Having a blast with BLAST

After much head scratching. I began an audit from the beginning, consulting the BLAST record that led me to select XM_012113510.1 for our study:

EU145592.1   XM_012113510.1   82.979   564   46   1   1   564   52   565   3.89e-175   625

Seems legit? An approximately 83% hit with full query coverage (564bp) consisting of 46 mismatches and containing an indel of length one? Not quite, as I would later find. But to cut an already long story somewhat shorter. The problem boils down to me being unable to read, again.

  • BLAST outfmt6 alignment lengths include deletions on the subject
  • BLAST outfmt6 tells you the number of gaps, not their size

For fuck’s sake, Sam. Sure enough, here’s the full alignment record. Complete with a fucking gigantic 50bp deletion on the subject, XM_012113510.

Query  1    ATGGTTGGTTCGCTAAACTGCATCGTCGCTGTGTCCCAGAACATGGGCATCGGCAAGAAC  60
            |||||| || ||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  52   ATGGTTCGTCCGCTAAACTGCATCGTCGCTGTGTCCCAGAACATGGGCATCGGCAAGAAC  111

Query  61   GGGGACCTGCCCTGGCCACCGCTCAGGAATGAATTCAGATATTTCCAGAGAATGACCACA  120
            ||| |||||||||||||||| |||||                                  
Sbjct  112  GGGAACCTGCCCTGGCCACCACTCAG----------------------------------  137

Query  121  ACCTCTTCAGTAGAAGGTAAACAGAATCTGGTGATTATGGGTAAGAAGACCTGGTTCTCC  180
                            ||||||||||| ||||||||||||||| ||||||||||||||||
Sbjct  138  ----------------GTAAACAGAATTTGGTGATTATGGGTAGGAAGACCTGGTTCTCC  181

Query  181  ATTCCTGAGAAGAATCGACCTTTAAAGGGTAGAATTAATTTAGTTCTCAGCAGAGAACTC  240
            ||||| ||||||||||||||||||||||  ||||||||| |||||||||| |||||||||
Sbjct  182  ATTCCAGAGAAGAATCGACCTTTAAAGGACAGAATTAATATAGTTCTCAGTAGAGAACTC  241

Query  241  AAGGAACCTCCACAAGGAGCTCATTTTCTTTCCAGAAGTCTAGATGATGCCTTAAAACTT  300
            |||||||||||| | ||||||||||||||| ||| |||||| |||||||||||| |||||
Sbjct  242  AAGGAACCTCCAAAGGGAGCTCATTTTCTTGCCAAAAGTCTGGATGATGCCTTAGAACTT  301

Query  301  ACTGAACAACCAGAATTAGCAAATAAAGTAGACATGGTCTGGATAGTTGGTGGCAGTTCT  360
            | |||| | ||||||||| ||||||||||||||||||| |||||||| || |||||||||
Sbjct  302  ATTGAAGATCCAGAATTAACAAATAAAGTAGACATGGTTTGGATAGTGGGAGGCAGTTCT  361

Query  361  GTTTATAAGGAAGCCATGAATCACCCAGGCCATCTTAAACTATTTGTGACAAGGATCATG  420
            || |||||||||||||||||  | ||||||||||||| ||||||||||||||||||||||
Sbjct  362  GTATATAAGGAAGCCATGAACAAGCCAGGCCATCTTAGACTATTTGTGACAAGGATCATG  421

Query  421  CAAGACTTTGAAAGTGACACGTTTTTTCCAGAAATTGATTTGGAGAAATATAAACTTCTG  480
            ||||| |||||||||||   |||||| |||||||||||||| || |||||||||||||| 
Sbjct  422  CAAGAATTTGAAAGTGATGTGTTTTTCCCAGAAATTGATTTTGAAAAATATAAACTTCTT  481

Query  481  CCAGAATACCCAGGTGTTCTCTCTGATGTCCAGGAGGAGAAAGGCATTAAGTACAAATTT  540
            |||||||| ||||||||||  |  |||||||||||||| |||||||||||||||||||||
Sbjct  482  CCAGAATATCCAGGTGTTCCTTTGGATGTCCAGGAGGAAAAAGGCATTAAGTACAAATTT  541

Query  541  GAAGTATATGAGAAGAATGATTAA  564
            ||||||||||| |||||  |||||
Sbjct  542  GAAGTATATGAAAAGAACAATTAA  565

It is no wonder we couldn’t recover 100% of the XM_012113510 haplotype, when compared to our pseudo-reference, 10% of it doesn’t fucking exist. Yet, it is interesting to see that the best recovered XM_012113510‘s were identified by BLAST to be very good hits to the XM_012113510 that actually exists in nature, despite my spurious 50bp insertion. Although Gretel is still biased by the structure of the pseudo-reference, (which is one of the reasons that insertions and deletions are still a pain), we are still able to make accurate recoveries around straightforward indel sites like this.

As lovely as it is to have gotten to the bottom of the XM_012113510 recovery troubles, this doesn’t really help our manuscript. We want to avoid complicated discussions and workarounds that may confuse or bore the reader. We already had trouble with our reviewers misunderstanding the purpose of the convoluted Triviome method and its evaluation. I don’t want to have to explain why recovery rates for XM_012113510 need special consideration because of this large indel.

I decided the best option was to find a new member of the band.

DHFR: Reunion Tour

As is the case whenever I am certain something won’t take the entire day, this was not as easy as I anticipated. Whilst it was trivial to BLAST my chosen pseudo-reference once more and find a suitable replacement (reading beyond “query cover” alone this time), trouble arose once more in the form of reads dropped via alignment.

Determined that these sequences should (and must) align, I dedicated the day to finally find some bowtie2 parameters that would more permissively align sequences to my pseudo-reference to ensure Gretel had the necessary evidence to recover even the more dissimilar sequences:

Success.

So, what’s next?

Wonderful. So after your productive day of heatmap generation, what are you going to do with your life new found alignment powers?

✓ Select a new DHFR line-up

The door has been blown wide open here. Previously our input test haplotypes had a lower bound of around 90% sequence similarity to avoid the loss of too many reads. However with my definition that I have termed --super-sensitive-local, we can attempt recoveries of genes that are even more dissimilar from the reference! I’ve actually already made the selection, electing to keep BC (99.8%), XR (97.3%) and AK (90.2%). I’ve remove KJ (93.2%) and the troublemaking XM (85.1%) to make way for the excitingly dissimilar M (83.5%) and (a different) XM (78.7%).

✓ Make a pretty picture

After some chitin and Sun Grid Engine wrangling, I present this:

✓ Stare in disbelief

Holy crap. It works.


Conclusion

  • Find what is inevitably wrong with these surprisingly excellent results
  • Get this fucking paper done
  • Merry Christmas

tl;dr

  • In a panic I went back to check my DHFR analysis and then despite it being OK, changed everything anyway
  • One of the genes from our paper is quite hard to recover, because 10% of it is fucking missing
  • Indels continue to be a pain in my arse
  • Contrary to what I said last week, not everything is awful
  • Coverage really is integral to Gretel‘s ability to make good recoveries
  • I’ve constructed some parameters that allow us to permissively align metagenomic reads to our pseudo-reference and shown it to massively improve Gretel‘s accuracy
  • bowtie2 needs some love <3
  • I probably care far too much about integrity of data and results and should just write my paper already
  • A lot of my problems boil down to not being able to read
  • I owe Hadley Wickham a drink to making it so easy to make my graphs look so pretty
  • Everyone agrees this should be my last Christmas as a PhD
  • Tom is mad because I keep telling him about bugs and pitfalls I have find in bioinformatics tools
]]>
https://samnicholls.net/2016/12/24/status-nov16-p2/feed/ 0 1916
bowtie2: Relaxed Parameters for Generous Alignments to Metagenomes https://samnicholls.net/2016/12/24/bowtie2-metagenomes/ https://samnicholls.net/2016/12/24/bowtie2-metagenomes/#respond Sat, 24 Dec 2016 00:34:46 +0000 https://samnicholls.net/?p=1991 In a change to my usual essay length posts, I wanted to share a quick bowtie2 tip for relaxing the parameters of alignment. It’s no big secret that bowtie2 has these options, and there’s some pretty good guidance in the manual, too. However, we’ve had significant trouble in our lab finding a suitable set of permissive alignment parameters.

In the course of my PhD work on haplotyping regions of metagenomes, I have found that even using bowtie2‘s somewhat permissive --very-sensitive-local, that sequences with an identity to the reference of less than 90% are significantly less likely to align back to that reference. This is problematic in my line of work, where I wish to recover all of the individual variants of a gene, as the basis of my approach relies on a set of short reads (50-250bp) aligned to a position on a metagenomic assembly (that I term the pseudo-reference). It’s important to note that I am not interested in the assembly of individual genomes from metagenomic reads, but the genes themselves.

Recently, the opportunity arose to provide some evidence to this. I have some datasets which constitute “synthetic metahaplomes” that consist of a handful of arbitrary known genes that all perform the same function, each from a different organism. These genes can be broken up into synthetic reads and aligned to some common reference (another gene in the same family).

This alignment can be used a means to test my metagenomic haplotyper; Gretel (and her novel brother data structure, Hansel), by attempting to recover the original input sequences, from these synthetic reads. I’ve already reported in my pre-print that our method is at the mercy of the preceding alignment, and used this as the hypothesis for a poor recovery in one of our data sets.

Indeed as part of my latest experiments, I have generated some coverage heat maps, showing the average coverage of each haplotype (Y-axis) at each position of the pseudo-reference (X-axis) and I’ve found that for sequences beyond the vicinity of 90% sequence identity, --very-sensitive-local becomes unsuitable.

The BLAST record below represents the alignment that corresponds to the gene whose reads go on to align at the average coverage depicted at the top bar of the above heatmap. Despite its 79% identity, it looks good(TM) to me, and I need sequence of this level of diversity to align to my pseudo-reference so it can be included in Gretel‘s analysis. I need generous alignment parameters to permit even quite diverse reads (but hopefully not too diverse such that it is no longer a gene of the same family) to map back to my reference. Otherwise Gretel will simply miss these haplotypes.

So despite having already spent many days of my PhD repeatedly failing to increase my overall alignment rates for my metagenomes, I felt this time it would be different. I had a method (my heatmap) to see how my alignment parameters affected the alignment rates of reads on a per-haplotype basis. It’s also taken until now for me to quantify just what sort of sequences we are missing out on, courtesy of dropped reads.

I was determined to get this right.

For a change, I’ll save you the anticipation and tell you what I settled on after about 36 hours of getting cross.

  • --local -D 20 -R 3
    Ensure we’re not performing end-to-end alignment (allow for soft clipping and the like), and borrow the most sensitive default “effort” parameters.
  • -L 3
    The seed substring length. Decreasing this from the default (20 - 25) to just 3 allows for a much more aggressive alignment, but adds computational cost. I actually had reasonably good results with -L 11, which might suit you if you have a much larger data set but still need to relax the aligner.
  • -N 1
    Permit a mismatch in the seed, because why not?
  • --gbar 1
    Has a small, but noticeable effect. Appears to thin the width of some of the coverage gap in the heatmap at the most stubborn sites.
  • --mp 4
    Reduces the maximum penalty that can be applied to a strongly supported (high quality) mismatch by a third (from the default value of 6). The aggregate sum of these penalties are responsible for the dropping of reads. Along with the substring length, this had a significant influence on increasing my alignment rates. If your coverage stains are stubborn, you could decrease this again.

Tada.


tl;dr

  • bowtie2 --local -D 20 -R 3 -L 3 -N 1 -p 8 --gbar 1 --mp 3
]]>
https://samnicholls.net/2016/12/24/bowtie2-metagenomes/feed/ 0 1991
Status Report: November 2016 (Part I): Triviomes, Treeviomes & Fuck Everything https://samnicholls.net/2016/12/19/status-nov16-p1/ https://samnicholls.net/2016/12/19/status-nov16-p1/#respond Mon, 19 Dec 2016 23:14:33 +0000 https://samnicholls.net/?p=1720 Once again, I have adequately confounded progress since my last report to both myself, and my supervisorial team such that it must be outlaid here. Since I’ve got back from having a lovely time away from bioinformatics, the focus has been to build on top of our highly shared but unfortunately rejected pre-print: Advances in the recovery of haplotypes from the metagenome.

I’d hoped to have a new-and-improved draft ready by Christmas, in time for an invited talk at Oxford, but sadly I’ve had to postpone both. Admittedly, it has taken quite some time for me to dust myself down after having the entire premise of my PhD so far rejected without re-submission, but I have finally built up the motivation to revisit what is quite a mammoth piece of work, and am hopeful that I can take some of the feedback on board to rein in the new year with an even better paper.

This will likely be the final update of the year.
This is also the last Christmas I hope to be a PhD candidate.

Friends and family can skip to the tldr

The adventure continues…

We left off with a lengthy introduction to my novel data structure; Hansel and algorithm; Gretel. In that post I briefly described some of the core concepts of my approach, such as how the Hansel matrix is reweighted after Gretel successfully creates a path (haplotype), how we automatically select a suitable value for the “lookback” parameter (i.e. the order of the Markov chain used when calculating probabilities for the next variant of a haplotype), and the current strategy for smoothing.

In particular, I described our current testing methodologies. In the absence of metagenomic data sets with known haplotypes, I improvised two strategies:

  • Trivial Haplomes (Triviomes)
    Data sets designed to be finely controlled, and well-defined. Short, random haplotypes and sets of reads are generated. We also generate the alignment and variant calls automatically to eliminate noise arising from the biases of external tools. These data sets are not expected to be indicative of performance on actual sequence data, but rather represent a platform on which we can test some of the limitations of the approach.

  • Synthetic Metahaplomes
    Designed to be more representative of the problem, we generate synthetic reads from a set of similar genes. The goal is to recover the known input genes, from an alignment of their reads against a pseudo-reference.

I felt our reviewers misunderstood both the purpose and results of the “triviomes”. In retrospect, this was probably due to the (albeit intentional) lack of any biological grounding distracting readers from the story at hand. The trivial haplotypes were randomly generated, such that none of them had any shared phylogeny. Every position across those haplotypes was deemed a SNP, and were often tetra-allelic. The idea behind this was to cut out the intermediate stage of needing to remove homogeneous positions across the haplotypes (or in fact, from even having to generate haplotypes that had homogeneous positions). Generated reads were thus seemingly unrealistic, at a length of 3-5bp. However they meant to represent not a 3-5bp piece of sequence, but the 3-5bp sequence that remains when one only considers genomic positions with variation, i.e. our reads were simulated such they spanned between 3 and 5 SNPs of our generated haplotypes.

I believe these confusing properties and their justifications got in the way of expressing their purpose, which was not to emulate the real metahaplotying problem, but to introduce some of the concepts and limitations of our approach in a controlled environment.

Additionally, our reviewers argued that the paper is lacking an extension to the evaluation of synthetic metahaplomes: data sets that contain real sequencing reads. Indeed, I felt that this was probably the largest weakness of my own paper, especially as it would not require an annotated metagenome. Though, I had purposefully stayed on the periphery of simulating a “proper” metagenome, as there are ongoing arguments in the literature as to the correct methodology and I wanted to avoid the simulation itself being used against our work. That said, it would be prudent to at least present small synthetic metahaplomes akin to the DHFR and AIMP1, using real reads.

So this leaves us with a few major plot points to work on before I can peddle the paper elsewhere:

  • Improve Triviomes
    We are already doing something interesting and novel, but the “triviomes” are evidently convoluting the explanation. We need something with more biological grounding such that we don’t need to spend many paragraphs explaining why we’ve made certain simplifications, or cause readers to question why we are doing things in a particular way. Note this new method will still need to give us a controlled environment to test the limitations of Hansel and Gretel.
  • Polish DHFR and AIMP1 analysis
    One of our reviewers misinterpreted some of the results, and drew a negative conclusion about Gretel‘s overall accuracy. I’d like to revisit the *DHFR* and *AIMP1* data sets to both improve the story we tell, but also to describe in more detail (with more experiments) under what conditions we can and cannot recover haplotypes accurately.
  • Real Reads
    Create and analyse a data set consisting of real reads.

The remainder of this post will focus on the first point, because otherwise no-one will read it.


Triviomes and Treeviomes

After a discussion about how my Triviomes did not pay off, where I believe I likened them to “random garbage”. It was clear that we needed a different tactic to introduce this work. Ideally this would be something simple enough that we could still have total control over both the metahaplome to be recovered, and the reads to recover it from, but also yield a simpler explanation for our readers.

My biology-sided supervisor, Chris, is an evolutionary biologist with a fetish for trees. Throughout my PhD so far, I have managed to steer away from phylogenetic trees and the like, especially after my terrifying first year foray into taxonomy, where I discovered that not only can nobody agree on what anything is, or where it should go, but there are many ways to skin a cat draw a tree.

Previously, I presented the aggregated recovery rates of randomly generated metahaplomes, for a series of experiments, where I varied the number of haplotypes, and their length. Remember that every position of these generated haplotypes was a variant. Thus, one may argue that the length of these random haplotypes was a poor proxy for genetic diversity. That is, we increased the number of variants (by making longer haplotypes) to artificially increase the level of diversity in the random metahaplome, and make recoveries more difficult. Chris pointed out that actually, we could specify and fix the level of diversity, and generate our haplotypes according to some… tree.

This seemed like an annoyingly neat and tidy solution to my problem. Biologically speaking, this is a much easier explanation to readers; our sequences will have meaning, our reads will look somewhat more realistic and most importantly, the recovery goal is all the more tangible. Yet at the same time, we still have precise control over the tree, and we can generate the synthetic reads in exactly the same way as before, allowing us to maintain tight control of their attributes. So, despite my aversion to anything that remotely resembles a dendrogram, on this occasion, I have yielded. I introduce the evaluation strategy to supplant1 my Triviomes: Treeviomes.

(Brief) Methodology

  • Heartlessly throw the Triviomes section in the bin
  • Generate a random start DNA sequence
  • Generate a Newick format tree. The tree is a representation of the metahaplome that we will attempt to recover. Each branch (taxa) of the tree corresponds to a haplotype. The shape of the tree will be a star, with each branch of uniform length. Thus, the tree depicts a number of equally diverse taxa from a shared origin
  • Use the tree to simulate evolution of the start DNA sequence to create the haplotypes that comprise the synthetic metahaplome
  • As before, generate reads (of a given length, at some level of coverage) from each haplotype, and automatically generate the alignment (we know where our generated reads should start and end on the reference without external tools) and variant calls (any heterogeneous genomic position when the reads are piled up)
  • Rinse and repeat, make pretty pictures

The foundation for this part of the work is set. Chris even recommended seq-gen as a tool that can simulate evolution from a starting DNA sequence, following a Newick tree, which I am using to generate our haplotypes. So I now have a push-buttan-to-metahaplome workflow that generates the necessary tree, haplotypes, and reads for testing Gretel.

I’ve had two main difficulties with Treeviomes…

• Throughput

Once again, running anything thousands of times has proven the bane of my life. Despite having a well defined workflow to generate and test a metahaplome, getting the various tools and scripts to work on the cluster here has been a complete pain in my arse. So much so, I ended up generating all of the data on my laptop (sequentially, over the course of a few days) and merely uploading the final BAMs and VCFs to our compute cluster to run Gretel. This has been pretty frustrating, especially when last weekend I set my laptop to work on creating a few thousand synthetic metahaplomes and promised some friends that I’d take the weekend off work for a change, only to find on Monday that my laptop had done exactly the same.

• Analysis

Rather unexpectedly, initial results raised more questions than answers. This was pretty unwelcome news following the faff involved in just generating and testing the many metahaplomes. Once Gretel‘s recoveries were finished (the smoothest part of the operation, which was a surprise in itself, given the presence of Sun Grid Engine), another disgusting munging script of my own doing spat out the convoluted plot below:

The figure is a matrix of boxplots where:

  • Horizontal facets are the number of taxa in the tree (i.e. haplotypes)
  • Vertical facets are per-haplotype, per-base mutation rates (i.e. the probability that any genomic position on any of the taxa may be mutated from the common origin sequence)
  • X-axis of each boxplot represents each haplotype in the metahaplome, labelled A – O
  • Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for a given haplotype A – O, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set of 150bp at 7x per-haplotype coverage)

We could make a few wild speculations, but no concrete conclusions:

  • At low diversity, it may be impossible to recover haplotypes, especially for metahaplomes containing fewer haplotypes
  • Increasing diversity appears to create more variance in accuracy, but mean accuracy increases slightly in datasets with 3-5 haplotypes, but falls under 10+
  • Increasing the number of haplotypes in the metahaplome appears to increase recovery accuracy
  • In general, whilst there is variation, recovery rates across haplotypes is fairly clustered
  • It is possible to achieve 100% accuracy for some haplotypes under high diversity, and few true haplotypes

The data is not substantial on the surface. But, if anything, I had seemed to refute my own pre-print. Counter-intuitively, we now seem to have shown that the problem is easier in the presence of more haplotypes, and more variation. I was particularly disappointed with the ~80% accuracy rates on mid-level diversity on just 3 haplotypes. Overall, comparing the recovery accuracy to that of my less realistic Triviomes, appeared worse.

This made me sad, but mostly cross.

The beginning of the end of my sanity

I despaired at the apparent loss of accuracy. Where had my over 90% recoveries gone? I could feel my PhD pouring away through my fingers like sand. What changed here? Indeed, I had altered the way I generated reads since the pre-print, was it the new read shredder? Or are we just less good at recovering from more realistic metahaplomes? With the astute assumption that everything I am working on equating to garbage, I decided to miserably withdraw from my PhD for a few days to play Eve Online…

I enjoyed my experiences of space. I began to wonder whether I should quit my PhD and become an astronaut, shortly before my multi-million ISK ship was obliterated by pirates. I lamented my inability to enjoy games that lack copious micromanagement, before accepting that I am destined to be grumpy in all universes and that perhaps for now I should be grumpy in the one where I have a PhD to write.

In retrospect, I figure that perhaps the results in my pre-print and the ones in our new megaboxplot were not in disagreement, but rather incomparable in the first place. Whilst an inconclusive conclusion on that front would not answer any of the other questions introduced by the boxplots, it would at least make me a bit feel better.

Scattering recovery rates by variant count

So I constructed a scatter plot to show the relationship between the number of called variants (i.e. SNPs), and best Gretel recovery rate for each haplotype of all of the tested metahaplomes (dots coloured by coverage level below), against the overall best average recovery rates from my pre-print (large black dots).

Immediately, it is obvious that we are discussing a difference in magnitude when it comes to numbers of called variants, particularly when base mutation rates are high. But if we are still looking for excuses, we can consider the additional caveats:

  • Read coverage from the paper is 3-5x per haplotype, whereas our new data set uses a fixed coverage of 7x
  • The number of variants on the original data sets (black dots) are guaranteed, and bounded, by their length (250bp max)
  • Haplotypes from the paper were generated randomly, with equal probabilities for nucleotide selection. We can consider this as a 3 in 4 chance of disagreeing with the pseudo-reference: a 0.75 base mutation rate). The most equivalent subset of our new data consists of metahaplomes with a base mutation rate of “just” 0.25.

Perhaps the most pertinent point here is the last. Without an insane 0.75 mutation rate dataset, it really is quite sketchy to debate how recovery rates of these two data sets should be compared. This said, from the graph we can see:

  • Those 90+% average recoveries I’m missing so badly belong to a very small subset of the original data, with very few SNPs (10-25)
  • There are still recovery rates stretching toward 100%, particularly for the 3 haplotype data set, but for base mutation of 2.5% and above
  • Actually, recovery rates are not so sad overall, considering the significant number of SNPs, particularly for the 5 and 10 haplotype metahaplomes

Recoveries are high for unrealistic variation

Given that a variation rate of 0.75 is incomparable, what’s a sensible amount of variation to concern ourselves with anyway? I ran the numbers on my DHFR and AIMP1 data sets; dividing the number of called variants on my contigs by their total length. Naively distributing the number of SNPs across each haplotype evenly, I found the magic number representing per-haplotype, per-base variation to be around 1.5% (0.015). Of course, that isn’t exactly a vigorous analysis, but perhaps points us in the right direction, if not the correct order of magnitude.

So the jig is up? We report high recovery rates for unnecessarily high variation rates (>2.5%), but our current data sets don’t seem to support the idea that Gretel needs to be capable of recovering from metahaplomes demonstrating that much variation. This is bad news, as conversely, both our megaboxplot and scatter plot show that for rates of 0.5%, Gretel recoveries were not possible in either of the 3 or 5 taxa metahaplomes. Additionally at a level of 1% (0.01), recovery success was mixed in our 3 taxa datasets. Even at the magic 1.5%, for both the 3 and 5 taxa, average recoveries sit uninterestingly between 75% and 87.5%.

Confounding variables are the true source of misery

Even with the feeling that my PhD is going through rapid unplanned disassembly with me still inside of it, I cannot shake off the curious result that increasing the number of taxa in the tree appears to improve recovery accuracy. Each faceted column of the megaboxplot shares elements of the same tree. That is, the 3 taxa 0.1 (or 1%) diversity rate tree, is a subtree of the 15 taxa 0.1 diversity tree. The haplotypes A, B and C, are shared. Yet why does the only reliable way to improve results among those haplotypes seem to be the addition of more haplotypes? In fact, why are the recovery rates of all the 10+ metahaplomes so good, even under per-base variation of half a percent?

We’ve found the trap door, and it is confounding.

Look again at the pretty scatter plot. Notice how the number of called variants increases as we increase the number of haplotypes, for the same level of variation. Notice that it is also possible to actually recover the same A, B, and C haplotype from 3-taxa trees, at low diversity when there are 10 or 15 taxa present.

Recall that each branch of our tree is weighted by the same diversity rate. Thus, when aligned to a pseudo-reference, synthetic reads generated from metahaplomes with more original haplotypes have a much higher per-position probability for containing at least one disagreeing nucleotide in a pileup. i.e. The number of variants is a function of the number of original haplotypes, not just their diversity.

The confounding factor is the influence of Gretel‘s lookback parameter: L. We automatically set the order of the Markov chain used to determine the next nucleotide variant given the last L selected variants, to be equal to the average number of SNPs spanned by all valid reads that populated the Hansel structure. A higher number of called variants in a dataset offers not only more pairwise evidence for Hansel and Gretel to consider (as there are more pairs of SNPs), but also a higher order Markov chain (as there are more pairs of SNPs, on the same read). Thus, with more SNPs, the hypothesis is Gretel has at her disposal, sequences of length L that are not only longer, but more unique to the haplotype that must be recovered.

It seems my counter-intuitive result of more variants and more haplotypes making the problem easier, has the potential to be true.

This theory explains the converse problem of being unable to recover any haplotypes from 3 and 5-taxa trees at low diversity. There simply aren’t enough variants to inform Gretel. After all, at a rate of 0.5%, one would expect a mere 5 variants per 1000bp. Our scatter plot shows for our 3000bp pseudo-reference, at the 0.5% level we observe fewer than 50 SNPs total, across the haplotypes of our 3-taxa tree. Our 150bp reads are not long enough to span the gaps between variants, and Gretel cannot make decisions on how to cross these gaps.

This doesn’t necessarily mean everything is not terrible, but it certainly means the megaboxplot is not only an awful way to demonstrate our results, but probably a poorly designed experiment too. We currently confound the average number of SNPs on reads by observing just the number of haplotypes, and their diversity. To add insult to statistical injury, we then plot them in facets that imply they can be fairly compared. Yet increasing the number of haplotypes, increases the number of variants, which increases the density of SNPs on reads, and improves Gretel‘s performance: we cannot compare the 3-taxa and 15-taxa trees of the same diversity in this way as the 15-taxa tree has an unfair advantage.

I debated with my resident PhD tree pervert about this. In particular, I suggested that perhaps the diversity could be equally split between the branches, such that synthetic read sets from a 3-taxa tree and 15-taxa tree should expect to have the same number of called variants, even if the individual haplotypes themselves have a different level of variation between the trees. Chris argued that whilst that would fix the problem and make the trees more comparable, but going against the grain of simple biological explanations would reintroduce the boilerplate explanation bloat to the paper that we were trying to avoid in the first place.

Around this time I decided to say fuck everything, gave up and wrote a shell for a little while.

Deconfounding the megabox

So where are we now? Firstly, I agreed with Chris. I think splitting the diversity between haplotypes, whilst yielding datasets that might be more readily comparable, will just make for more difficult explanations in our paper. But fundamentally, I don’t think these comparisons actually help us to tell the story of Hansel and Gretel. I thought afterwards, and there are other nasty, unobserved variables in our megaboxplot experiment that directly affect the density of variants on reads, namely: read length and read coverage. We had fixed these to 150bp and 7x coverage for the purpose of our analysis, which felt like a dirty trick.

At this point, bioinformatics was starting to feel like a grand conspiracy, and I was in on it. Would it even be possible to fairly test and describe how our algorithm works through the noise of all of these confounding factors?

I envisaged the most honest method to describe the efficacy of my approach, as a sort of lookup table. I want our prospective users to be able to determine what sort of haplotype recovery rates might be possible from their metagenome, given a few known attributes, such as read length and coverage, at their region of interest. I also feel obligated to show under what circumstances Gretel performs less well, and offer reasoning for why. But ultimately, I want readers to know that this stuff is really fucking hard.

Introducing the all new low-fat less-garbage megaboxplot

Here is where I am right now. I took this lookup idea, and ran a new experiment consisting of some 1500 sets of reads, and runs of Gretel, and threw the results together to make this:

  • Horizontal facets represent synthetic read length
  • Vertical facets are (again) per-haplotype, per-base mutation rates, this time expressed as a percentage (so a rate of 0.01, is now 1%)
  • Colour coded X-axis of each boxplot depicts the average per-haplotype read coverage
  • Y-axis of each boxplot quantifies the average best recovery rate made by Gretel for all of the five haplotypes, over ten executions of Gretel (each using a different randomly generated, uniformly distributed read set)

I feel this graph is much more tangible to users and readers. I feel much more comfortable expressing our recovery rates in this format, and I hope eventually our reviewers and real users will agree. Immediately we can see this figure reinforces some expectations, primarily increasing the read length and/or coverage, has a large improvement on Gretel‘s performance. Increasing read length also lowers the requirements on coverage for accuracy.

This seems like a reasonable proof of concept, so what’s next?

  • Generate a significant amount more input data, preferably in a way that doesn’t make me feel ill or depressed
  • Battle with the cluster to execute more experiments
  • Generate many more pretty graphs

I’d like to run this test for metahaplomes with a different number of taxa, just to satisfy my curiosity. I also want to investigate the 1 – 2% diversity region in a more fine grain fashion. Particularly important will be to repeat the experiments with multiple metahaplomes for each read length, coverage and sequence diversity parameter triplet, to randomise away the influence of the tree itself. I’m confident this is the reason for inconsistencies in the latest plot, such as the 1.5% diversity tree with 100bp reads yielding no results (likely due to this particular tree generating haplotypes such that piled up reads contain a pair of variants more than 100bp apart).


Conclusion

  • Generate more fucking metahaplomes
  • Get this fucking paper out

tl;dr

  • I don’t want to be doing this PhD thing in a year’s time
  • I’ve finally started looking again at our glorious rejected pre-print
  • The Trivial haplomes tanked, they were too hard to explain to reviewers and actually don’t provide that much context on Gretel anyway
  • New tree-based datasets have superseded the triviomes2
  • Phylogenetics maybe isn’t so bad (but I’m still not sure)
  • Once again, the cluster and parallelism in general has proven to be the bane of my fucking life
  • It can be quite difficult to present results in a sensible and meaningful fashion
  • There are so many confounding factors in analysis and I feel obligated to control for them all because it feels like bad science otherwise
  • I’m fucking losing it lately
  • Playing spaceships in space is great but don’t expect to not be blown out of fucking orbit just because you are trying to have a nice time
  • I really love ggplot2, even if the rest of R is garbage
  • I’ve been testing Gretel at “silly” levels of variation thinking that this gives proof that we are good at really hard problems, but actually more variation seems to make the problem of recovery easier
  • 1.5% per-haplotype per-base mutation seems to be my current magic number (n=2, because fuck you)
  • I wrote a shell because keeping track of all of this has been an unmitigated clusterfuck
  • I now have some plots that make me feel less like I want to jump off something tall
  • I only seem to enjoy video games that have plenty of micromanagement that stress me out more than my PhD
  • I think Bioinformatics PhD Simulator 2018 would make a great game
  • Unrealistic testing cannot give realistic answers
  • My supervisor, Chris is a massive dendrophile3
  • HR bullshit makes a grumpy PhD student much more grumpy
  • This stuff, is really fucking hard

  1. supplant HAH GET IT 
  2. superseeded HAHAH I AM ON FIRE 
  3. phylogenphile? 
]]>
https://samnicholls.net/2016/12/19/status-nov16-p1/feed/ 0 1720
Bioinformatics is a disorganised disaster and I am too. So I made a shell. https://samnicholls.net/2016/11/16/disorganised-disaster/ https://samnicholls.net/2016/11/16/disorganised-disaster/#respond Wed, 16 Nov 2016 17:50:59 +0000 https://samnicholls.net/?p=1581 If you don’t want to hear me wax lyrical about how disorganised I am, you can skip ahead to where I tell you about how great the pseudo-shell that I made and named chitin is.

Back in 2014, about half way through my undergraduate dissertation (Application of Machine Learning Techniques to Next Generation Sequencing Quality Control), I made an unsettling discovery.

I am disorganised.

The discovery was made after my supervisor asked a few interesting questions regarding some of my earlier discarded analyses. When I returned to the data to try and answer those questions, I found I simply could not regenerate the results. Despite the fact that both the code and each “experiment” were tracked by a git repository and I’d written my programs to output (what I thought to be) reasonable logs, I still could not reproduce my science. It could have been anything: an ad-hoc, temporary tweak to a harness script, a bug fix in the code itself masking a result, or any number of other possible untracked changes to the inputs or program parameters. In general, it was clear that I had failed to collect all pertinent metadata for an experiment.

Whilst it perhaps sounds like I was guilty of negligent book-keeping, it really wasn’t for lack of trying. Yet when dealing with many interesting questions at once, it’s so easy to make ad-hoc changes, or perform undocumented command line based munging of input data, or accidentally run a new experiment that clobbers something. Occasionally, one just forgets to make a note of something, or assumes a change is temporary but for one reason or another, the change becomes permanent without explanation. These subtle pipeline alterations are easily made all the time, and can silently invalidate swathes of results generated before (and/or after) them.

Ultimately, for the purpose of reproducibility, almost everything (copies of inputs, outputs, logs, configurations) was dumped and tar‘d for each experiment. But this approach brought problems of its own: just tabulating results was difficult in its own right. In the end, I was pleased with that dissertation, but a small part of me still hurts when I think back to the problem of archiving and analysing those result sets.

It was a nightmare, and I promised it would never happen again.

Except it has.

A relapse of disorganisation

Two years later and I’ve continued to be capable of convincing a committee to allow me to progress towards adding the title of doctor to my bank account. As part of this quest, recently I was inspecting the results of a harness script responsible for generating trivial haplotypes, corresponding reads and attempting to recover them using Gretel. “Very interesting, but what will happen if I change the simulated read size”, I pondered; shortly before making an ad-hoc change to the harness script and inadvertently destroying the integrity of the results I had just finished inspecting by clobbering the input alignment file used as a parameter to Gretel.

Argh, not again.

Why is this hard?

Consider Gretel: she’s not just a simple standalone tool that one can execute to rescue haplotypes from the metagenome. One must go through the motions of pushing their raw reads through some form of pipeline (pictured below) to generate an alignment (to essentially give a co-ordinate system to those reads) and discover the variants (the positions in that co-ordinate system that relate to polymorphisms on reads) that form the required inputs for the recovery algorithm first.

This is problematic for one who wishes to be aware of the providence of all outputs of Gretel, as those outputs depend not only on the immediate inputs (the alignment and called variants), but the entirety of the pipeline that produced them. Thus we must capture as much information as possible regarding all of the steps that occur from the moment the raw reads hit the disk, up to Gretel finishing with extracted haplotypes.

But as I described in my last status report, these tools are themselves non-trivial. bowtie2 has more switches than an average spaceship, and its output depends on its complex set of parameters and inputs (that also have dependencies on previous commands), too.

img_20161110_103257

bash scripts are all well and good for keeping track of a series of commands that yield the result of an experiment, and one can create a nice new directory in which to place such a result at the end – along with any log files and a copy of the harness script itself for good measure. But what happens when future experiments use different pipeline components, with different parameters, or we alter the generation of log files to make way for other metadata? What’s a good directory naming strategy for archiving results anyway? What if parts (or even all of the) analysis are ad-hoc and we are left to reconstruct the history? How many times have you made a manual edit to a malformed file, or had to look up exactly what combination of sed, awk and grep munging you did that one time?

One would have expected me to have learned my lesson by now, but I think meticulous digital lab book-keeping is just not that easy.

What does organisation even mean anyway?

I think the problem is perhaps exacerbated by conflating the meaning of “organisation”. There are a few somewhat different, but ultimately overlapping problems here:

  • How to keep track of how files are created
    What command created file foo? What were the parameters? When was it executed, by whom?
  • Be aware of the role that each file plays in your pipeline
    What commands go on to use file foo? Is it still needed?
  • Assure the ongoing integrity of past and future results
    Does this alignment have reads? Is that FASTA index up to date?
    Are we about to clobber shared inputs (large BAMS, references) that results depend on?
  • Archiving results in a sensible fashion for future recall and comparison
    How can we make it easy to find and analyse results in future?

Indeed, my previous attempts at organisation address some but not all of these points, which is likely the source of my bad feeling. Keeping hold of bash scripts can help me determine how files are created, and the role those files go on to play in the pipeline; but results are merely dumped in a directory. Such directories are created with good intent, and named something that was likely useful and meaningful at the time. Unfortunately, I find that these directories become less and less useful as archive labels as time goes on… For example, what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd100/1?

This approach also had no way to assure the current and future integrity of my results. Last month I had an issue with Gretel outputting bizarrely formatted haplotype FASTAs. After chasing my tail trying to find a bug in my FASTA I/O handling, I discovered this was actually caused by an out of date FASTA index (.fai) on the master reference. At some point I’d exchanged one FASTA for another, assuming that the index would be regenerated automatically. It wasn’t. Thus the integrity of experiments using that combination of FASTA+index was damaged. Additionally, the integrity of the results generated using the old FASTA were now also damaged: I’d clobbered the old master input.

There is a clear need to keep better metadata for files, executed commands and results, beyond just tracking everything with git. We need a better way to document the changes a command makes in the file system, and a mechanism to better assure integrity. Finally we need a method to archive experimental results in a more friendly way than a time-sensitive graveyard of timestamps, acronyms and abbreviations.

So I’ve taken it upon myself to get distracted from my PhD to embark on a new adventure to save myself from ruining my PhD2, and fix bioinformatics for everyone.

Approaches for automated command collection

Taking the number of post-its attached to my computer and my sporadically used notebooks as evidence enough to outright skip over the suggestion of a paper based solution to these problems, I see two schools of thought for capturing commands and metadata computationally:

  • Intrusive, but data is structured with perfect recall
    A method whereby users must execute commands via some sort of wrapper. All commands must have some form of template that describes inputs, parameters and outputs. The wrapper then “fills in” the options and dispatches the command on the user’s behalf. All captured metadata has uniform structure and nicely avoids the need to attempt to parse user input. Command reconstruction is perfect but usage is arguably clunky.
  • Unobtrusive, best-effort data collection
    A daemon-like tool that attempts to collect executed commands from the user’s shell and monitor directories for file activity. Parsing command parameters and inputs is done in a naive best-effort scenario. The context of parsed commands and parameters is unknown; we don’t know what a particular command does, and cannot immediately discern between inputs, outputs, flags and arguments. But, despite the lack of structured data, the user does not notice our presence.

There is a trade-off between usability and data quality here. If we sit between a user and all of their commands, offering a uniform interface to execute any piece of software, we can obtain perfectly structured information and are explicitly aware of parameter selections and the paths of all inputs and desired outputs. We know exactly where to monitor for file system changes, and can offer user interfaces that not only merely enumerate command executions, but offer searching and filtering capabilities based on captured parameters: “Show me assemblies that used a k-mer size of 31”.

But we must ask ourselves, how much is that fine-grained data worth to us? Is exchanging our ability to execute commands ourselves worth the perfectly structured data we can get via the wrapper? How much of those parameters are actually useful? Will I ever need to find all my bowtie2 alignments that used 16 threads? There are other concerns here too: templates that define a job specification must be maintained. Someone must be responsible for adding new (or removing old) parameters to these templates when tools are updated. What if somebody happens to misconfigure such a template? More advanced users may be frustrated at being unable to merely execute their job on the command line. Less advanced users could be upset that they can’t just copy and paste commands from the manual or biostars. What about smaller jobs? Must one really define a command template to run trivial tools like awk, sed, tail, or samtools sort through the wrapper?

It turns out I know the answer to this already: the trade-off is not worth it.

Intrusive wrappers don’t work: a sidenote on sunblock

Without wanting to bloat this post unnecessarily, I want to briefly discuss a tool I’ve written previously, but first I must set the scene3.

Within weeks of starting my PhD, I made a computational enemy in the form of Sun Grid Engine: the scheduler software responsible for queuing, dispatching, executing and reporting on jobs submitted to the institute’s cluster. I rapidly became frustrated with having an unorganised collection of job scripts, with ad-hoc edits that meant I could no longer re-run a job previously executed with the same submission script (does this problem sound familiar?). In particular, I was upset with the state of the tools provided by SGE for reporting on the status of jobs.

To cheer myself up, I authored a tool called sunblock, with the goal of never having to look at any component of Sun Grid Engine directly ever again. I was successful in my endeavour and to this day continue to use the tool on the occasion where I need to use the cluster.

screenshot-from-2016-11-16-16-11-11

However, as hypothesised above, sunblock does indeed require an explicit description of an interface for any job that one would wish to submit to the cluster, and it does prevent users from just pasting commands into their terminal. This all-encompassing wrapping feature; that allows us to capture the best, structured information on every job, is also the tool’s complete downfall. Despite the useful information that could be extracted using sunblock (there is even a shiny sunblock web interface), its ability to automatically re-run jobs and the superior reporting on job progress compared to SGE alone, was still not enough to get user traction in our institute.

For the same reason that I think more in-the-know bioinformaticians don’t want to use Galaxy, sunblock failed: because it gets in the way.

Introducing chitin: an awful shell for awful bioinformaticians

Taking what I learned from my experimentation with sunblock on-board, I elected to take the less intrusive, best-effort route to collecting user commands and file system changes. Thus I introduce chitin: a Python based tool that (somewhat)-unobtrusively wraps your system shell, to keep track of commands and file manipulations to address the problem of not knowing how any of the files in your ridiculously complicated bioinformatics pipeline came to be.

I initially began the project with a view to create a digital lab book manager. I envisaged offering a command line tool with several subcommands, one of which could take a command for execution. However as soon as I tried out my prototype and found myself prepending the majority of my commands with lab execute, I wondered whether I could do better. What if I just wrapped the system shell and captured all entered commands? This might seem a rather dumb and long-about way of getting one’s command history, but consider this: if we wrap the system shell as a means to capture all the input, we are also in a position to capture the output for clever things, too. Imagine a shell that could parse the stdout for useful metadata to tag files with…

I liked what I was imagining, and so despite my best efforts to get even just one person to convince me otherwise; I wrote my own pseudo-shell.

chitin is already able to track executed commands that yield changes to the file system. For each file in the chitin tree, there is a full modification history. Better yet, you can ask what series of commands need to be executed in order to recreate a particular file in your workflow. It’s also possible to tag files with potentially useful metadata, and so chitin takes advantage of this by adding the runtime4, and current user to all executed commands for you.

Additionally, I’ve tried to find my own middle ground between the sunblock-esque configurations that yielded superior metadata, and not getting in the way of our users too much. So one may optionally specify handlers that can be applied to detected commands, and captured stdout/stderr. For example, thanks to my bowtie2 configuration, chitin tags my out.sam files with the overall alignment rate (and a few targeted parameters of interest), automatically.

screenshot-from-2016-11-16-17-21-30

chitin also allows you to specify handlers for particular file formats to be applied to files as they are encountered. My environment, for example, is set-up to count the number of reads inside a BAM, and associate that metadata with that version of the file:

screenshot-from-2016-11-16-17-30-55

In this vein, we are in a nice position to check on the status of files before and after a command is executed. To address some of my integrity woes, chitin allows you to define integrity handlers for particular file formats too. Thus my environment warns me if a BAM has 0 reads, is missing an index, or has an index older than itself. Similarly, an empty VCF raises a warning, as does an out of date FASTA index. Coming shortly will be additional checks for whether you are about to clobber a file that is depended on by other files in your workflow. Kinda cool, even if I do say so myself.

Conclusion

Perhaps I’m trying to solve a problem of my own creation. Yet from a few conversations I’ve had with folks in my lab, and frankly, anyone I could get to listen to me for five minutes about managing bioinformatics pipelines, there seems to be sympathy to my cause. I’m not entirely convinced myself that a “shell” is the correct solution here, but it does seem to place us in the best position to get commands entered by the user, with the added bonus of getting stdout to parse for free. Though, judging by the flurry of Twitter activity on my dramatically posted chitin screenshots lately, I suspect I am not so alone in my disorganisation and there are at least a handful of bioinformaticians out there who think a shell isn’t the most terrible solution to this either. Perhaps I just need to be more of a wet-lab biologist.

Either way, I genuinely think there’s a lot of room to do cool stuff here, and to my surprise, I’m genuinely finding chitin quite useful already. If you’d like to try it out, the source for chitin is open and free on GitHub. Please don’t expect too much in the way of stability, though.


tl;dr

  • A definition of “being organised” for science and experimentation is hard to pin down
  • But independent of such a definition, I am terminally disorganised
  • Seriously what the fuck is ../5-virus-mix/2016-10-11__ref896__reg2084-5083__sd1001
  • I think command wrappers and platforms like Galaxy get in the way of things too much
  • I wrote a “shell” to try and compensate for this
  • Now I have a shell, it is called chitin

  1. This is a genuine directory in my file system, created about a month ago. It contains results for a run of Gretel against the pol gene on the HIV genome (2084-5083). Off the top of my head, I cannot recall what sd100 is, or why reg appears before the base positions. I honestly tried. 
  2. Because more things that are not my actual PhD is just what my PhD needs. 
  3. If it helps you, imagine some soft jazz playing to the sound of rain while I talk about this gruffly in the dark with a cigarette poking out of my mouth. Oh, and everything is in black and white. It’s bioinformatique noir
  4. I’m quite pleased with this one, because I pretty much always forget to time how long my assemblies and alignments take. 
]]>
https://samnicholls.net/2016/11/16/disorganised-disaster/feed/ 0 1581