The kernels on our cluster clients have recently been updated after I inadvertently stumbled across an old1 kernel bug that caused erratic behaviour when NFS tries to open a directory containing many files that are being written to simultaneously (more on which is another post in itself really, as usual).
The update seems to have caused my rapsearch
(a BLAST
alternative I’m trying out) job scripts to exhibit strange behaviour this morning; terminating with SIGABRT
, dumping the core and outputting the following to stderr
:
1 2 3 4 5 |
terminate called after throwing an instance of 'boost::archive::archive_exception' what(): invalid signature /cm/local/apps/sge/current/spool/node001/job_scripts/1439428: line 282: 4846 Aborted (core dumped) rapsearch -q $QUERY -d /ibers/ernie/groups/rumenISPG/Databases/RAPsearch_bacteria_1 -u 1 -z 5 -e 0.00001 > $OUTFILE |
Our sysadmin suspects that SGE (Sun Grid Engine — the job scheduler) caches its own copy of associated libraries and binaries to support checkpoints, which may have resulted in a version mismatch for the boost
library on the cluster clients following their kernel update.
Killing the jobs that had been submitted prior to the updates and resubmitting them seemed to fix this issue, before I noticed that my output directory contained hundreds of 5.9GB core dumps that had filled the remaining 2TB of our cluster’s scratch disk causing I/O errors all round. Oops. Sorry everyone. Deleted, killed, resubmitted.
Regardless, I was confused to see that SGE reported all the terminated jobs as completing successfully. This is somewhat annoying as I filter tasks (of which there can be thousands) by their exit status, using awk
and such to extract the task IDs for jobs which have failed for resubmission.
I’m aware commands can return a non-zero exit status in a non-error scenario, for example grep
returns 1
if the search string is not found in the target and you wouldn’t necessarily want your script to terminate under those circumstances. But I had assumed that when boost
detected an internal error and called abort()
the error would be fatal and kill the script with a code of 128+6
, but this is clearly not the case! Instead the job script continues running, executing the housekeeping commands that follow (such as moving the $OUT.rap6.wip
output file to $OUT.rap6
) — the exit status of the script is therefore the status of the final (successful) mv
command!
But what if I really do want a script to terminate for any non-zero return code? A StackOverflow answer suggested adding the following to the top of the script to cause bash
to quit in these scenarios:
1 2 |
set -e |
Testing this out, I submitted a stub job, only to find it terminates with a non-zero exit code almost immediately after beginning with no information on stdout
or stderr
. I sprinkled the script with echo
statements and discovered the problem:
1 2 |
CURR_i=$(expr $SGE_TASK_ID - 1) |
This is one of the “housekeeping” lines that I automatically include in the header of my job scripts, it gives the script a 0-delimited task ID which is used as an array index to get the i’th filepath to use as input to the current job. Turns out that expr
has an exit status of 1
if it evaluates to 0 — which it would for the 1st job in a job array as 1 - 1 = 0
2. This wouldn’t have been a surprise if I had read the manual:
Exit status is […] 1 if
EXPRESSION
is null or 0
The top answer to a StackOverflow question on the topic warns that using expr
for arithmetic in this manner has long been obsolete and I should be using bash
‘s $((...))
construct which won’t evaluate to non-zero in this case. Taking this on board, I updated the line:
1 2 |
CURR_i=$(($SGE_TASK_ID - 1)) |
Stub script works fine (no early termination). Returning to my rapsearch
job, it now seems to be throwing a different error that still causes a SIGABRT
and core dump (or it would if I had not disabled it3) due to an instance of std::bad_alloc
being thrown. Investigation of this will have to wait until tomorrow, at least now the script is stopped immediately and returns a non-zero status which can be tracked by my tools.
At the very least we can keep a grasp on what is failing now. Hooray.
tl;dr
- Disabling core dumps is probably a good idea, especially if you are running a job thousands of times
- Forcing bash scripts to fail on any non-zero status is also probably a good idea
expr
returns a non-zero status for expressions that evaluate to0
- Kernel bugs are the worst
- As in an old kernel, at the time we were using 2.6.32:
Linux bert 2.6.32-220.7.1.el6.x86_64 #1 SMP
Tue Mar 6 15:45:33 CST 2012 x86_64 x86_64 x86_64 GNU/LinuxThe bug was fixed in an update but currently we’re unsure how “far” we can update the kernels on the clients without having to recompile everything and thus invalidate the environment all our experiments have been running on since the cluster went up… ↩
- I have a maths degree, so you can probably trust me on this. ↩
- For future reference, I added the following housekeeping line to disable core dumps:
bash
↩
set ulimit -c 0