error – Samposium https://samnicholls.net The Exciting Adventures of Sam Tue, 09 Jul 2019 20:20:09 +0000 en-GB hourly 1 https://wordpress.org/?v=5.7.5 101350222 How to switch SATA controller driver from RAID to AHCI on Windows 10 without a reinstall https://samnicholls.net/2016/01/14/how-to-switch-sata-raid-to-ahci-windows-10-xps-13/ https://samnicholls.net/2016/01/14/how-to-switch-sata-raid-to-ahci-windows-10-xps-13/#respond Thu, 14 Jan 2016 03:50:57 +0000 https://samnicholls.net/?p=513 This advice was posted in 2016 and worked for me. I’m aware that it has worked for others, but for several people (see comments), it wrecked their computer. Before following this advice, you should know what you are doing and be confident and prepared to restore your computer from a good backup. Good luck!
For those who don’t care for the learning adventure, just skip to my guide to switch storage controller driver without wrecking Windows.

This evening, I was bemused to find a Linux live disk unable to identify the storage volume on my new Dell XPS 13 laptop. A quick search introduced me to a problem I have not encountered before; the SSD was likely configured to use a SATA controller mode that did not have a driver in the kernel of the live disk installer. This is typically when the stock disk has been shipped in either IDE (for backwards compatibility purposes) or a vendor specific RAID mode, instead of the native Advanced Host Controller Interface (AHCI) that exposes some of SATAs more advanced features.

One can easily change this setting in the BIOS. On my XPS I had to navigate to System Configuration > SATA Configuration and switch the radio button selection from RAID On to AHCI. A rather scary warning informed me that this would more than likely break my existing partitions. As a curious scientist with a recovery partition as a safety net, I decided to proceed anyway. Unsurprisingly, Windows 10 failed to boot, electing to display the dreaded sideways smiley face and a suggestion that I read up about the INACCESSIBLE_BOOT_DEVICE error. Oops.

It turns out, to optimize boot times, Windows disables drivers that are deemed unnecessary for startup during installation. Herein lies the problem, if the OS is installed while the disk is in one of these other modes (in my case RAID), the driver that would allow us to speak AHCI to our speaking AHCI-speaking SATA storage controller is effectively disabled (even though it is installed). Windows, without the ability to communicate with the disk correctly, has no real option but to fall on its side with a glum expression and throw the INACCESSIBLE_BOOT_DEVICE error during startup. The accusations are corroborated by the Wikipedia article on the subject of AHCI:

Some operating systems, notably Windows Vista, Windows 7, Windows 8 and Windows 10 do not configure themselves to load the AHCI driver upon boot if the SATA-drive controller was not in AHCI mode at the time of installation. This can cause failure to boot, with an error message, if the SATA controller is later switched to AHCI mode.

So what are we to do? If I want to install and run Linux, I need my SSD’s SATA controller to be set to AHCI1. Yet if I want to dual-boot with Windows, I need to use RAID to match the currently installed Intel vendor driver. A conundrum!

Official advice from vendors like Intel is that you should format the disk, set the controller mode as desired and then reinstall the Windows operating system. But this seems somewhat of a cop out, what if lazy people like me don’t have physical installation media to hand, or don’t want to go through the hassle of a format and reinstall? Evidently, I am not the first to ask this question; as there are many threads online that attempt to achieve this for Windows 102, with varying degrees of success garnered from fiddling around in the registry (and variants thereof) to merely booting into safe mode and back. Unfortunately, none of these fixes worked for me and so I worked to come up with my own:

Sam’s super easy guide to switching your SATA Controller from RAID to AHCI without destroying your Windows 10 disk

  • Boot to Windows with your current SATA controller configuration
  • Open Device Manager
  • Expand Storage Controllers and identify the Intel SATA RAID Controller
  • View properties of the identified controller
  • On the Driver tab, click the Update driver… button
  • Browse my computer…, Let me pick…
  • Uncheck Show compatible hardware
  • Select Microsoft as manufacturer
  • Select Microsoft Storage Spaces Controller as model3
  • Accept that Windows cannot confirm that this driver is compatible
  • Save changes, reboot to BIOS and change RAID SATA Controller to AHCI
  • Save changes and reboot normally, hopefully to Windows

If you’ve exhausted your luck elsewhere, I hope this works for you as it did for me, but your mileage will almost certainly vary.


    1. Confusingly, according to the AHCI article on Wikipedia:   

      Intel recommends choosing RAID mode on their motherboards (which also enables AHCI) rather than AHCI/SATA mode for maximum flexibility.

      If this really is the case, why doesn’t our trusty Linux live disk installer identify the dual-wielding AHCI and RAID disk in question? I wisely chose to stop at the entrance to the rabbit hole on this occasion and was just happy I could move on with my Linux installation. ↩

 

    1. Other articles exist for Windows 7 and Windows 8 too, but are generally disregarded as unhelpful for Windows 10. In particular, the default AHCI driver provided by Microsoft changed name between versions 7 and 8, so much of the advice pertains to registry keys and files that don’t exist if followed for versions 8 and 10. ↩

 

  1. A more specific driver from your vendor may be available for your storage controller. ↩
]]>
https://samnicholls.net/2016/01/14/how-to-switch-sata-raid-to-ahci-windows-10-xps-13/feed/ 0 513
Duplicate definition error with GATK PrintReads and MalformedReadFilter https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/ https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/#comments Thu, 07 Jan 2016 19:27:17 +0000 https://samnicholls.net/?p=468 This afternoon I wanted to quickly check1 whether some reads in a BAM would be filtered out by the GATK MalformedReadFilter. As you can’t invoke the filter alone, I figured one of the quickest ways to do this would be to utilise GATK PrintReads, which pretty much parses and spits out input BAMs, while also allowing one to specify filters and the like to be applied to the parser as it dutifully goes by its job of taking up all your cluster’s memory. I entered the command, taking care to specify MalformedRead for the -rf read filter option, feeling particularly pleased with myself for finally being capable of using a GATK command from memory:

java -jar GenomeAnalysisTK.jar -T PrintReads -rf MalformedRead -I <INPUT> -R <REFERENCE>

GATK, wanting to teach me a lesson for not consulting documentation, quickly dumped a stack trace to my terminal and wiped the smile off my face.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Duplicate definition of argument with full name: filter_reads_with_N_cigar
        at org.broadinstitute.gatk.utils.commandline.ArgumentDefinitions.add(ArgumentDefinitions.java:59)
        at org.broadinstitute.gatk.utils.commandline.ParsingEngine.addArgumentSource(ParsingEngine.java:150)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:207)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Duplicate definition of argument with full name: filter_reads_with_N_cigar
##### ERROR ------------------------------------------------------------------------------------------

At this point I felt somewhat hopeless, I was actually trying to use the MalformedReadFilter to debug something else, now I was stuck two errors deep surrounded by more Java than I could stomach. Before having a full breakdown about whether bioinformatics really is broken, I remembered I am a little familiar with the filter in question. Indeed, I recognised the filter_reads_with_N_cigar argument from the error as one that can be supplied to the MalformedReadFilter itself. This seems a little odd, where could it be getting a duplicate definition from?

Of course, from my own blog post and the PrintReads manual page, I should have recalled that the MalformedReadFilter is automatically applied by PrintReads. Specifying the same filter on top with -rf apparently causes somewhat of a parsing upset. So there you have it, if you want to check whether your reads will be discarded by the MalformedReadFilter, you can just use PrintReads:

java -jar GenomeAnalysisTK.jar -T PrintReads I <INPUT> -R <REFERENCE>

tl;dr

  • GATK PrintReads applies the MalformedReadFilter automatically
  • Specifying -rf MalformedRead to PrintReads is not only redundant but problematic
  • Always read the fucking manual
  • Read your own damn blog
  • GATK is unforgiving

  1. It’s about time I realised that in bioinformatics, nobody has ever successfully “quickly checked” anything. 
]]>
https://samnicholls.net/2016/01/07/gatk-printreads-malformedreadfilter/feed/ 1 468
Grokking GATK: Common Pitfalls with the Genome Analysis Tool Kit (and Picard) https://samnicholls.net/2015/11/11/grokking-gatk/ https://samnicholls.net/2015/11/11/grokking-gatk/#comments Wed, 11 Nov 2015 16:11:50 +0000 https://samnicholls.net/?p=336 The Genome Analysis Tool Kit (“the” GATK) is a big part of our pipeline here. Recently I’ve been following the DNASeq Best Practice Pipeline for my limpet sequence data. Here are some of the mistakes I made and how I made them go away.

Input file extension pedanticism

Invalid command line: The GATK reads argument (-I, –input_file) supports only BAM/CRAM files with the .bam/.cram extension

Starting small, this was a simple oversight on my part, my naming script had made a mistake but I knew the files were BAM, so I ignored the issue and continued with the pipeline anyway. GATK, however was not impressed and aborted immediately. A minor annoyance (the error even acknowledges the input appears to be BAM) but a trivial fix.

A sequence dictionary (and index) is compulsory for use of a FASTA reference

Fasta dict file <ref>.dict for reference <ref>.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unmentioned in the documentation for the RealignerTargetCreator tool I was using, a sequence dictionary for the reference FASTA must be built and present in the same directory. The error kindly refers you to a help article on how one can achieve this with Picard and indeed, the process is simple:

java -jar ~/git/picard-tools-1.138/picard.jar CreateSequenceDictionary R=<ref>.fa O=<ref>.dict

Though, I am somewhat confused as to exactly what exactly a .dict file provides GATK over a FASTA index .fai (which is also required). Both files include the name and length of each contig in the reference, but the corresponding FASTA also includes positional information vital to enabling fast random access. The only additional information in the SAM-header-like sequence dictionary appears to be an MD5 hash of the sequence which doesn’t seem overly useful in this scenario. I guess the .dict adds a layer of protection if GATK uses the hash as a sanity check, ensuring the loaded reference matches the one for which the index and dictionary were constructed.

You forgot to index your intermediate BAM

Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in –unsafe mode, but this GATK feature is currently unsupported.

Another frequently occurring issue caused by user forgetfulness. Following the best practice pipeline, one generates many “intermediate” BAMs, each of these must be indexed for efficient use during the following step, otherwise GATK will be disappointed with your lack of attention to detail and refuse to do any work for you.

Edit (13 Nov):  A helpful reddit comment from a Picard contributor recommended to set CREATE_INDEX=True when using Picard to automatically create an index of your newly output BAM automatically. Handy!

Your temporary directory is probably too small

Unable to create a temporary BAM schedule file. Please make sure Java can write to the default temp directory or use -Djava.io.tmpdir= to instruct it to use a different temp directory instead.

GATK appears to love creating hundreds of thousands of small bamschedule.* files, which according to a glance at some relevant looking GATK source appears to handle multithreaded merging of large BAM files. Such in number are these files, their presence totalled my limited temporary space. This was especially frustrating given the job had run for several hours blissfully unaware that there are only so many things you can store in a shoebox. To avoid such disaster, inform Java of a more suitable location to store junk:

java -Djava.io.tmpdir=/not/a/shoebox/ -jar <jar> <tool> ...

In rare occasions, you may encounter permission errors when writing to a temporary directory. Specifying java.io.tmpdir as above will free you of these woes too.

You may have too many files and not enough file handles

Picard and GATK try to store some number of reads (or other plentiful metadata) in RAM during the parsing and handling of BAM files. When this limit is exceeded, reads are spilled to disk. Both Picard and GATK appear to keep file handles for these spilled reads open simultaneously, presumably for fast access. But your executing user is likely limited to carrying only so many handles before becoming over encumbered, falling to the ground with throwing an exception being the only option:

Exception in thread “main” htsjdk.samtools.SAMException: […].tmp not found
[…]
Caused by: java.io.FileNotFoundException: […].tmp (Too many open files)

In my case, I encountered this error when using Picard MarkDuplicates which has a default maximum number of file handles1. This ceiling happened to be higher than that of the system itself. The fix in this case is trivial, use ulimit -n to determine the number of files your system will permit you to have a handle on at once and inform MarkDuplicates using the MAX_FILE_HANDLES_FOR_READ_ENDS_MAP parameter:

$ ulimit -n
1024

$ java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ...

This is somewhat counter-intuitive as the error is caused by an acute overabundance of file handles, yet my suggested fix is to permit even fewer handles? In this case at least, it appears Picard compensates by creating fewer, larger spill files. You’ll notice I didn’t use the exact value of ulimit -n in the argument; it’s likely there’ll be a few other file handles open here and there (your input, output and metrics file, at least) and so you’ll stumble across the same error once more.

From a little search, it appears that for the most part GATK will open as many files as it wants and if that number is greater than ulimit -n, it will throw a tantrum. Unfortunately, you’re out of luck here for solving the problem on your own. Non administrative users cannot increase the number of file handles they are permitted to have open and so you’ll need to befriend your system administrator and kindly request that the hard limit for file handles be raised before continuing. Though, the same link does suggest that lowering the number of GATK execution threads can potentially alleviate the issue in some cases.

Your maximum Java heap is also too small

There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java

GATK has an eating problem, GATK has no self restraint when memory is on the table. I’m not sure whether GATK was brought up with many siblings that had to fight for food but it certainly doesn’t help that it is implemented in Java, a language not particularly known for its memory efficiency. When invoked, Java will allocate a heap to pile the many objects it wants to keep around, with a typical maximum size of around 1GB. It’s not enough to just specify to your job scheduler that you need all of the RAM, but you need to let Java know that it is welcome to expand the heap for dumping genomes beyond the default maximum. Luckily this is quite simple:

java -Xmx:<int>G -jar <jar> <tool> ...

The MalformedReadFilter has a looser definition of malformed than expected

I’ve touched on this discovery that the GATK MalformedReadFilter is much more aggressive than its documentation lets on previously. The lovely GATK developers have even opened an issue about it after I reported it in their forum.


tl;dr

  • Your BAM files should end in .bam
  • Any FASTA based reference needs both an index (.fai) and dictionary (.dict)
  • Be indexing, always
  • pysam is a pretty nice package for dealing with SAM/BAM files in Python
  • Your temp dir is too small, specify -Djava.io.tmpdir=/path/to/big/disk/ to java when invoking GATK
  • Picard may generously overestimate the number of file handles available
  • GATK is a spoilt child and will have as many file handles as it wants
  • Apply more memory to GATK with java -Xmx:<int>G to avoid running out of heap
  • Remember, the MalformedReadFilter is rather aggressive
  • You need a bigger computer

  1. At the time of writing, 8000. 
]]>
https://samnicholls.net/2015/11/11/grokking-gatk/feed/ 1 336