Aligned Annihilation II: Dumpster Diving

   Sam Nicholls    No Comments yet    AU-PhD

WarningThis is going to get a bit technical and probably quite boring. If you are a victim of a boost::archive::archive_exception, skip to my conclusion.

If you haven’t already looked at what led me to this awful situation, check out what happened when I annihilated all my alignment data.

The Hard Way

I’ve managed to get rapsearch to generate two core dumps for inspection with gdb but I am struggling to extract any useful information beyond what I already know. The abort() and raise() can clearly be seen in the backtrace and prior to that, construction of a standard library exception. There appears to be a brief scene missing between rapsearch calling out to boost::archive::basic_binary_iarchive and the error being constructed, leaving us with those ?? frames.

Desperately attempting to avoid having to edit and recompile rapsearch, I began nosing around the core in a similar fashion as one would poke a stick around in a dirty pond. At first I naively tried to explore frame 3, treating 0x31926bcbd6 as “the exception” before realising the address was for a function. If we translate (“unmangle”1) the symbol we can guess it is responsible for handling assignment of an exception_ptr:

I’d have to try harder. I started looking at the registers for the same frame, as the function accepts an exception pointer as a parameter it should be stored in a register here. It was when I started reading about x86 Calling Conventions that I realised I was probably in over my head. But a victim to the sunken cost fallacy now, I had to continue. I disassembled the frame:

It is my understanding that the %rdi register contains the first parameter given to a function and so here I’d expect to see some form of pointer address. I’ll dump the values held in the registers at this frame too.

Unfortunately 0xb96 doesn’t appear to point to anything. I wondered whether it would be worth trying to manually work through the assembly instructions instead. After determining which way around to even read the instructions2, I could become my own arch-nemesis, a computer:

Beep boop. Ok, so that’s a start. _ZNSt15__exception_ptr13exception_ptrC1ERKS0_ unmangles3 to:

A constructor! Expecting a reference to pointer as its first parameter. So what’s left in %rdi after the subtraction? It looks address-worthy, let’s examine it:

Seems promising? We’re hunting for information on an archive_exception! _ZTIN5boost7archive17archive_exceptionE unmangles to:

Just typeinfo, not an actual archive_exception instance. Weird. What about the hex? Is it an address? Where does it go?

It is, and the symbol unmangles to:

Hm. Too far. I’m not interested in the contents of a vtable. The archive_exception is a virtual class so this will be where it’s function pointers are populated at runtime. We want the specific instance of the class that is raised, we want that error code.

Incidentally, eagle-eyed viewers will note the result of my computation was already stored in %rbp. Oddly I thought this is where the frame’s base pointer should be and find it unclear why it instead points to a typeinfo object. But as disclaimed, I have no idea what I’m doing here.

Lunch slipped by me as I tried endless combinations of address lookups, quickly getting lost in the 37GB core file. Let’s look back up the stack.

Frame 6 holds the actual call that leaves rapsearch in an error state. Disassembly clocks in at about 250 lines of instructions so I’ll cut out some potentially interesting lines instead:

_ZN5boost7archive17archive_exceptionC1ENS1_14exception_codeEPKcS4_ takes my eye and rightfully so, it demangles to:

Bingo. We’ve got where the archive_exception is constructed! The first parameter is the exception_code. It appears the contents of %r13 are moved to %rdi just prior to the call.

Yet inspecting the various registers, I still can’t get out the exception code. Dan then asked me if I was sure the symbols are definitely correct. Recalling the recent recompilation debacle, which was complete with its own boost library mis-version misadventure and the rapsearch repository containing a two year old version of boost, no, I couldn’t be sure at all.

I recompiled rapsearch and tried to run gdb on the same core. The symbols were different. Shortly after, the work day drew to a close4 and sick of feeling like I’d been manually searching through a cow’s core dump, I gave up.

it’s ok at least we had fun right

The Easy Way

There must be an easier way, I thought, just before bed. The prelude to this epic introduced the error at hand:

That post also linked to a boost serialization archive exceptions manual entry that Dan kindly located. Listed within are various types of errors that can be raised, mapped by an enum:

Wait. what()? That looks familiar?

Oh dear5. I’d disregarded the what() error as it sounded confusing and mysterious and not related to file handling. Yet it was trying to tell us the answer all along:

invalid_signature
Archives are initiated with a known string. If this string is not found when the archive is opened, It is presumed that this file is not a valid archive and this exception is thrown.

So, what’s the verdict?

I had a hunch that this might have been caused by corruption of the temporary files that rapsearch creates in the working directory. For each job, rapsearch creates temp files in the format .tmp, where outname is the basename of the output file and N is the index of the temporary file, starting at 0.

It’s not uncommon to execute multiple jobs that share an output directory. Here, I was trying to keep my data organised by storing the alignment hits for bacterial, archaeal and fungal associated hydrolases on my limpet contigs in the same place.

However, rapsearch creates a .m8 storing hits along with a somewhat esoteric .aln alignment file. But one cannot prevent the latter file from being generated (Update: Or so I thought at the time, see below). I thought I’d try and be clever and find a way around having to just delete any .aln files once the job had completed and found rapsearch accepts a -u option:

Great. I’ll specify -u 1 for .m8 output only and redirect stdout to my $OUTFILE. Job done. Except not. When using this stream option, rapsearch isn’t writing to any files and so has no value to prepend to the .tmp temp suffix. So what happens?

Every job ends up sharing a .tmp0 file. Which as you can imagine goes down pretty well. As a job progresses, rapsearch heads off to disk to write to its trusty temp file only to discover the archive header has been tampered with, which is upsetting enough to throw an error. Cue the invalid_signature on stage error. Mystery solved.

Blackout. Drop curtain.

Update: A few hours

All six jobs, running in harmony:

Note those all important unique temporary file names!

Update: Bedtime

In case this ever affects anybody else, I’ve notified the developers by opening an issue on the rapsearch Github. Hooray for open sorcery!

Update: A few days and another manpage later

Turns out, one can suppress the .aln file after all. As demonstrated in the rapsearch usage examples, the -b option (help entry below) can be set to 0.

This seems somewhat counter-intuitive to me and is simply not something I had thought of trying. If anything I’d have expected -b 0 to just create an empty .aln file! Nevertheless, this is a blog about metagenomics, not user interface design, so let’s get on with some science.


tl;dr

  • Life feels different now, I took it too far. I learned more about registers, calling conventions and assembly and have seen too much. I would very much like to never do this again.6
  • rapsearch probably corrupts temporary files causing job failure for searches writing to the same output directory if you don’t use the -o option.
  • I am not a very good computer.
  • Read errors. Believe errors.

  1. Thanks to Dan for showing me this symbol demangling tool, as well as for putting up with hours of remote interrogation. You may increment your drinks counter. 
  2. Made more difficult as combinations of assembly instruction and which way around yielded many search results for construction of IKEA flatpack furniture. 
  3. Thanks to Dan for showing me this symbol demangling tool, as well as for putting up with hours of remote interrogation. You may increment your drinks counter. 
  4. Two hours ago. 
  5. An understatement. 
  6. I can’t justify complaint about staring in to the abyss of disassembly because my lack of attention brought on my fate.