alignment – Samposium https://samnicholls.net The Exciting Adventures of Sam Sat, 24 Dec 2016 17:41:17 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 bowtie2: Relaxed Parameters for Generous Alignments to Metagenomes https://samnicholls.net/2016/12/24/bowtie2-metagenomes/ https://samnicholls.net/2016/12/24/bowtie2-metagenomes/#respond Sat, 24 Dec 2016 00:34:46 +0000 https://samnicholls.net/?p=1991 In a change to my usual essay length posts, I wanted to share a quick bowtie2 tip for relaxing the parameters of alignment. It’s no big secret that bowtie2 has these options, and there’s some pretty good guidance in the manual, too. However, we’ve had significant trouble in our lab finding a suitable set of permissive alignment parameters.

In the course of my PhD work on haplotyping regions of metagenomes, I have found that even using bowtie2‘s somewhat permissive --very-sensitive-local, that sequences with an identity to the reference of less than 90% are significantly less likely to align back to that reference. This is problematic in my line of work, where I wish to recover all of the individual variants of a gene, as the basis of my approach relies on a set of short reads (50-250bp) aligned to a position on a metagenomic assembly (that I term the pseudo-reference). It’s important to note that I am not interested in the assembly of individual genomes from metagenomic reads, but the genes themselves.

Recently, the opportunity arose to provide some evidence to this. I have some datasets which constitute “synthetic metahaplomes” that consist of a handful of arbitrary known genes that all perform the same function, each from a different organism. These genes can be broken up into synthetic reads and aligned to some common reference (another gene in the same family).

This alignment can be used a means to test my metagenomic haplotyper; Gretel (and her novel brother data structure, Hansel), by attempting to recover the original input sequences, from these synthetic reads. I’ve already reported in my pre-print that our method is at the mercy of the preceding alignment, and used this as the hypothesis for a poor recovery in one of our data sets.

Indeed as part of my latest experiments, I have generated some coverage heat maps, showing the average coverage of each haplotype (Y-axis) at each position of the pseudo-reference (X-axis) and I’ve found that for sequences beyond the vicinity of 90% sequence identity, --very-sensitive-local becomes unsuitable.

The BLAST record below represents the alignment that corresponds to the gene whose reads go on to align at the average coverage depicted at the top bar of the above heatmap. Despite its 79% identity, it looks good(TM) to me, and I need sequence of this level of diversity to align to my pseudo-reference so it can be included in Gretel‘s analysis. I need generous alignment parameters to permit even quite diverse reads (but hopefully not too diverse such that it is no longer a gene of the same family) to map back to my reference. Otherwise Gretel will simply miss these haplotypes.

So despite having already spent many days of my PhD repeatedly failing to increase my overall alignment rates for my metagenomes, I felt this time it would be different. I had a method (my heatmap) to see how my alignment parameters affected the alignment rates of reads on a per-haplotype basis. It’s also taken until now for me to quantify just what sort of sequences we are missing out on, courtesy of dropped reads.

I was determined to get this right.

For a change, I’ll save you the anticipation and tell you what I settled on after about 36 hours of getting cross.

  • --local -D 20 -R 3
    Ensure we’re not performing end-to-end alignment (allow for soft clipping and the like), and borrow the most sensitive default “effort” parameters.
  • -L 3
    The seed substring length. Decreasing this from the default (20 - 25) to just 3 allows for a much more aggressive alignment, but adds computational cost. I actually had reasonably good results with -L 11, which might suit you if you have a much larger data set but still need to relax the aligner.
  • -N 1
    Permit a mismatch in the seed, because why not?
  • --gbar 1
    Has a small, but noticeable effect. Appears to thin the width of some of the coverage gap in the heatmap at the most stubborn sites.
  • --mp 4
    Reduces the maximum penalty that can be applied to a strongly supported (high quality) mismatch by a third (from the default value of 6). The aggregate sum of these penalties are responsible for the dropping of reads. Along with the substring length, this had a significant influence on increasing my alignment rates. If your coverage stains are stubborn, you could decrease this again.

Tada.


tl;dr

  • bowtie2 --local -D 20 -R 3 -L 3 -N 1 -p 8 --gbar 1 --mp 3
]]>
https://samnicholls.net/2016/12/24/bowtie2-metagenomes/feed/ 0 1991
Teaching children how to be a sequence aligner with Lego at Science Week https://samnicholls.net/2016/03/29/abersciweek16/ https://samnicholls.net/2016/03/29/abersciweek16/#respond Tue, 29 Mar 2016 22:59:46 +0000 https://samnicholls.net/?p=612 As part of a PhD it is anticipated1 that you will share your science with various audiences; fellow PhD students, peers in the field and the various publics. Every year, the university celebrates British Science Week with a Science Fair, inviting possibly the most difficult public to engage with: children. Over three days the fair serves to educate and entertain 1700 pupils from over 30 schools based across Mid Wales, and this year I volunteered2 to run a stand.

How to explain assembly?

I was inspired by Amanda’s activity for prospective students at a visiting day a few weeks prior. To describe the problem of DNA sequence assembly and alignment in a friendly (and quick) way, Amanda had hundreds of small pieces of paper representing DNA reads. The read set was generated with Titus Brown’s shotgunator tool, slicing a few sentences about the problem (meta!) into k-mers, with a few errors and omissions for good measure. Visitors were asked to help us assemble the original sequence (the sentences) by exploiting the overlaps between reads.

I like this activity as it gives a reasonable intuition for how assembly of genomes works, using just scraps of paper. Key is that the DNA is abstracted into something more tangible to newcomers – English words building sentences – which is far simpler to explain and understand, especially in a short time. It’s also quite easy to describe some of the more complicated issues of assembly, namely errors and repeats via misspellings and repeated words or phrases.

A problem with pigeonholing college students?

Yet to my surprise, the majority of the compscis-to-be were quite apprehensive of taking on the task at the mere mention of this being a biological problem, despite the fact that sequence alignment can be easily framed as a text manipulation problem. Their apprehension only increased when introduced to Amanda’s genome game; a fun web-based game that generates a small population with a short binary genome whose rules must be guessed before the time runs out. A few puzzled visitors offered various flavours of “…but I’m not here to do biology!”, and one participant backed out of playing with “…but biology is scary and too hard!”. In general the activities had a reasonable reception but visitors appeared more interested in the Arduinos, web games and robots – their comfort zone, presumably.

One need not necessarily be an expert in biology (I’m certainly not) to be able to contribute to the study of computationally framed questions in that field. As mentioned, DNA alignment is effectively string manipulation and those strings could be anything! Indeed this is even demonstrated by our activity using English sentences rather than the alphabet ACGT.

From experience, undergraduates (and apparently college students) appear keen to pigeonhole themselves early (“…dammit Jim I’m a computer scientist not a bioinformatician”) via their prior beliefs to the meaning of “computing”, and their module/A-level choices. I think it is at this stage where subjects outside one’s choices become “scary” and fall outside one’s scope of interest — “…if I wanted to learn biology why would I be doing compsci?”. Yet most jobs from finance to game development will require some domain specific knowledge and reading outside computing, whether its economics, physics or even art and soundscape design.

This is why it is important as a computer science department that we introduce undergraduates to other potential applications of the field. It’s not that we should push students to study bioinformatics over robotics, but that many students can easily go on unaware that computing can be widely applicable to research endeavours in different fields in the first place. Though to combat the “this is not my area” issue, in our department, many assignments have a real-world element, often just tidbits of domain specific knowledge that force students to recognise the need for base understanding of something outside of their comfort zone.

Lego: a unicorn-like universal engagement tool

College students aside, I needed to work out how to engage schoolchildren between the ages of 10-12 with this activity. Scraps of paper would be unlikely to hold the attention of my target age group for long. I needed something more tangible and less fiddly than strips of paper. It was while describing the problem of introducing these “building blocks of nature” to kids in a simple way when the perfect metaphor popped into mind: Lego.

Yes! A 2×2 brick can represent an individual nucleotide, and we can use different coloured bricks to colour code the four nucleotides (and maybe another for “missing” if we’re feeling mean). A small stack of bricks builds a short string of DNA to represent a read. The colour code effectively abstracts away the potentially-confusing ACGT alphabet, making the alignment game easier to play (matching just colours, rather than symbols that need parsing first) and also quite aesthetically pleasing.

The hard part, was sourcing enough Lego. I returned to my parents’ home to dig through my childhood and retrieve years worth of collected pieces, but once back in Aberystwyth I was surprised to find that after sorting through two whole boxes I did not own more than some 100 2×2 bricks (and most were not in colours I wanted). Bricks, it appears, are actually quite hard to come by! I put out a request for help on the Aber Comp Sci Facebook group and a lecturer kindly performed the same sort with his children’s collections. Their collection must have been more substantial and yielded 150-200 bricks in a mix of four colours, saving my stand.

The setup

The activity itself is simple and needs nothing other than some patter, the Lego and a surface for kids to align the pieces on. I spent more time than I would like to admit covering a cardboard box with tinfoil to create the SAMTECH SEQUENCER 9000 (described by Illumina as “shiny”), a prop to contextualise the problem: we can’t look at whole genomes, only short pieces of it that need assembly.

IMG_20160315_121713284

Of course, we’d need some read sets. To make these, I divided the available bricks into two piles, Nathan and I then each ad-libbed sliding k-mers of length 5 (i.e. each stack would have stacks with overlaps of length 4, 3, 2 and 1 coloured brick – which each had their own overlaps…) to build up an arbitrary genome to recover. Simple!

Running the activity

Once doors opened, there was no shortage of children wanting to try out the stand. I think the mystery of the tinfoil box and the allure of playing with Lego was enough to grab attention, though Nathan (my lovely assistant) and I would flag down passers-by if the table was free. Pupils were encouraged to visit as many activities as possible by means of a questionnaire, on which each stand posed a scientific question that could be answered by completing that particular stand’s activity. Unfortunately for us, our stand’s question was not included on the questionnaire (I guess we submitted it too late) but luckily, we found pupils were keen to write down and find an answer to our “bonus question” after all.

We quickly developed a double-act routine; opening by quizzing our aligners on what they knew about DNA, which was typically not much, though it was nice to hear that the majority were aware that “it’s inside us”. Interestingly, of the pupils who responded in the positive to being asked what DNA was, their exposure was primarily from television – specifically when used for identification of criminals. Nathan would then explain that if we wanted to look at somebody’s DNA, we would take a sample from them and process it with the shiny tinfoil sequencer. This special machine would apply some magic science and produce short DNA reads that had to be pieced back together to recover the whole genome.

At this point we’d invite participants to open the lid of the sequencer and take out a batch of reads (of a possible two sets) for assembly. We’d explain the rules and show some examples of a correct alignment: sequences of matching runs of colour between two or more Lego stacks. Once they got the hang of it, we’d leave them to it for a little while. The two sets meant that we could split larger groups into pairs or triplets to ensure that everybody had a chance to make some successful alignments.

As the teams came to finishing alignment of the most obvious motifs (Nathan and I both accidentally made a few triplets of colours that resembled well known flags in our read sets – which was handy), progress would begin to slow and a few more difficult or red-herring reads would be left over, and Nathan or I would start narrating the problem, asking teams if this had been more difficult than expected. I don’t think any team agreed that the activity had been easy! We used this as an opportunity to interrupt the game to frame how complicated assembly is for real sequences and reveal the answer to our question.

The debrief

This was my favourite part, I’d hold up one of the Lego stacks and pull it apart. “Each of these bricks is a single base, stacked together they make this read which tells us a what a small part of a much longer genome looks like”. I’d then ask how long they imagine a whole human genome might be. Answers most frequently ranged between 100 – 1000, a minority guessed between 4 – 15. No pupil ventured guesses beyond a million. For the very small guesses, I’d assemble a Lego stack of that length and ask if they still thought the differences between us all could be explained by such a short genome – nobody changed their mind3.

The look on their faces when I revealed it was actually three billion made the entire activity worth it. If we had enough Lego to build a genome, it would be 28,800km tall and stretch into space far beyond where global positioning satellites are in orbit. I’d explain that when we do this for real, the stacks aren’t five bases long, but more like a hundred, and instead of the handful of reads we had in our tinfoil sequencer, there were millions of reads to align and assemble. They’d gasp and look around at each-other’s faces, equally stunned. We even had some teachers dumbfounded by this reveal. “This is why computers are now so important in biology, this would be impossible otherwise!”. We’d clear up any last questions or confusions and thank them for playing.

Some observations

I would not consider our first group a rallying success. I was not ready for how difficult assembly of a set of unique 5-mers would be. The group had significant trouble recovering the genome and as it turned out, Nathan and I did too. The situation had not been helped by the fact that the group had also taken a mix of reads from both batches in the tinfoil sequencer. As it turns out, even trivial assembly is really hard. I could tell the kids were somewhat disappointed and the difficulty of the game had hampered their enjoyment. We recovered by wowing them with facts about the human genome and they asked some good questions too. Once they left the table, Nathan began the patter with the next group as I hurriedly worked to reduce the number of red-herring reads and recycle the bricks to create duplicate reads which allowed groups to make progress more quickly at the beginning (and effectively turned difficulty into a ramp, rather than uniformly hard to play). This improved further games considerably.

I was surprised how happily the pupils were to append our fairly long question to an already quite lengthy questionnaire, and how keen they were to find the answer, too. Not a single pupil was put off from our activity at the mention of biology, DNA or even unfamiliar terminology like “sequencer”, or “read”. Fascinatingly, Amanda also ran the aforementioned genome game and it was a hit. I guess primary school students are just open to a very wide definition of science and are yet to pigeonhole themselves? Activities like this at an early age have the potential to massively influence how our next generation of scientists see science as a large collaborative effort, skills can be transferred and shared to solve important and interesting questions. The pupils simply had no idea that computers could be used like this, for science, let alone biologically inspired questions.

In general the activity went down very well, the kids seem to get the concept very quickly and also understood the (albeit naive) parallel to DNA. I think they genuinely learned a thing or two (the human genome is big!) and enjoyed themselves. I’m pleased that we managed to draw and keep attention to our stand, given we were wedged between a bunch of old Atari consoles and a display of unmanned aerial vehicles.

I was definitely surprised at how much I enjoyed running the stand too. I’m not overly fond of children and was expecting to have to put on a brave face to deal with tiny disinterested people in assorted bright sweaters all day. Yet all but one or two pupils were happy to be here, incredibly enthusiastic to learn, asked great questions (sometimes incredibly insightful questions) and genuinely had a nice time and thanked us for it. Enjoyment aside, I took the second day off as I’d also found running the activity over and over, oddly draining.

Future activities

If I were to run this again, I’d like to make it a little more interactive and ideally give players a chance to actually use Lego for its intended purpose: building something. Thankfully at our stand, students were not particularly disappointed when our rules stated that couldn’t take the reads apart, or put them together (i.e. couldn’t actually play with the Lego…). To improve, my idea would be to get participants to construct a short genome out of Lego pieces that can be truly “sequenced” by pushing it through some sort of colour sensor or camera apparatus attached to an Arduino inside a future iteration of the trusty SAMTECH Sequencer range. Some trivial software would then give the player some sort of monster to name4, print off and call their own.

To run the activity again in its current form, I think I’d need to have more Lego. However, it turns out that packs of 2×2 bricks in one colour are widely available on eBay and Amazon, though aren’t actually that much cheaper than ordering via the “Pick a Brick” service on the canonical Lego website. I’ve ordered a few packs (at an astonishing £0.12 per brick) as I would like to try and run this activity at other events to spread the sheer joy that bioinformatics can bring to one’s afternoon.

To give the current version of the game a little more of a goal, it would have been ideal to explain the concept of a genomic reference and have the players align the reads to that (as well as eachother), in effect this would have been like solving the edges of a jigsaw and given a sense of quick progress (which means fun) and also afford us the opportunity to explain more of the “real science” behind the game. To make the game more difficult, we could have properly employed “missing bases” and the common issues that plague assembly including repeats (which is easier to explain with a reference), as well as errors. After the first group at the Science Fair, I quickly removed the majority of sneaky errors as it made the game too “mean” (where Nathan or I had to explain “No that one doesn’t go there!” too frequently).

Some proof what I did public engagement5

tl;dr

  • Actual Lego bricks are hard to come by (unless you just buy them)
  • Typical ten year olds are not as dumb or as apathetic to science as one might expect
  • Assembly is actually pretty hard
  • Engaging with children with science is exhausting but surprisingly rewarding
  • Acquire more Lego
  • It’s very hard to tinfoil a cardboard box nicely

  1. Read, required. 
  2. Read, was coerced. 
  3. With a single Lego brick in hand, one kid looked me dead in the eye and said “Yeah!” when asked if this single base could explain the differences between every human on Earth. 
  4. Genome McGenface? 
  5. Absolutely not using this to pass my public engagement module. 
]]>
https://samnicholls.net/2016/03/29/abersciweek16/feed/ 0 612