The kernels on our cluster clients have recently been updated after I inadvertently stumbled across an old1 kernel bug that caused erratic behaviour when NFS tries to open a directory containing many files that are being written to simultaneously (more on which is another post in itself really, as usual).
The update seems to have caused my
BLAST alternative I’m trying out) job scripts to exhibit strange behaviour this morning; terminating with
SIGABRT, dumping the core and outputting the following to
terminate called after throwing an instance of 'boost::archive::archive_exception'
what(): invalid signature
/cm/local/apps/sge/current/spool/node001/job_scripts/1439428: line 282: 4846 Aborted
(core dumped) rapsearch -q $QUERY -d /ibers/ernie/groups/rumenISPG/Databases/RAPsearch_bacteria_1 -u 1 -z 5 -e 0.00001 > $OUTFILE
Our sysadmin suspects that SGE (Sun Grid Engine — the job scheduler) caches its own copy of associated libraries and binaries to support checkpoints, which may have resulted in a version mismatch for the
boost library on the cluster clients following their kernel update.
Killing the jobs that had been submitted prior to the updates and resubmitting them seemed to fix this issue, before I noticed that my output directory contained hundreds of 5.9GB core dumps that had filled the remaining 2TB of our cluster’s scratch disk causing I/O errors all round. Oops. Sorry everyone. Deleted, killed, resubmitted.
Regardless, I was confused to see that SGE reported all the terminated jobs as completing successfully. This is somewhat annoying as I filter tasks (of which there can be thousands) by their exit status, using
awk and such to extract the task IDs for jobs which have failed for resubmission.
I’m aware commands can return a non-zero exit status in a non-error scenario, for example
1 if the search string is not found in the target and you wouldn’t necessarily want your script to terminate under those circumstances. But I had assumed that when
boost detected an internal error and called
abort() the error would be fatal and kill the script with a code of
128+6, but this is clearly not the case! Instead the job script continues running, executing the housekeeping commands that follow (such as moving the
$OUT.rap6.wip output file to
$OUT.rap6) — the exit status of the script is therefore the status of the final (successful)
But what if I really do want a script to terminate for any non-zero return code? A StackOverflow answer suggested adding the following to the top of the script to cause
bash to quit in these scenarios:
Testing this out, I submitted a stub job, only to find it terminates with a non-zero exit code almost immediately after beginning with no information on
stderr. I sprinkled the script with
echo statements and discovered the problem:
CURR_i=$(expr $SGE_TASK_ID - 1)
This is one of the “housekeeping” lines that I automatically include in the header of my job scripts, it gives the script a 0-delimited task ID which is used as an array index to get the i’th filepath to use as input to the current job. Turns out that
expr has an exit status of
1 if it evaluates to 0 — which it would for the 1st job in a job array as
1 - 1 = 02. This wouldn’t have been a surprise if I had read the manual:
Exit status is […] 1 if
EXPRESSIONis null or 0
The top answer to a StackOverflow question on the topic warns that using
expr for arithmetic in this manner has long been obsolete and I should be using
$((...)) construct which won’t evaluate to non-zero in this case. Taking this on board, I updated the line:
CURR_i=$(($SGE_TASK_ID - 1))
Stub script works fine (no early termination). Returning to my
rapsearch job, it now seems to be throwing a different error that still causes a
SIGABRT and core dump (or it would if I had not disabled it3) due to an instance of
std::bad_alloc being thrown. Investigation of this will have to wait until tomorrow, at least now the script is stopped immediately and returns a non-zero status which can be tracked by my tools.
At the very least we can keep a grasp on what is failing now. Hooray.
- Disabling core dumps is probably a good idea, especially if you are running a job thousands of times
- Forcing bash scripts to fail on any non-zero status is also probably a good idea
exprreturns a non-zero status for expressions that evaluate to
- Kernel bugs are the worst
- As in an old kernel, at the time we were using 2.6.32:
Linux bert 2.6.32-220.7.1.el6.x86_64 #1 SMP
Tue Mar 6 15:45:33 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
The bug was fixed in an update but currently we’re unsure how “far” we can update the kernels on the clients without having to recompile everything and thus invalidate the environment all our experiments have been running on since the cluster went up… ↩
- I have a maths degree, so you can probably trust me on this. ↩
- For future reference, I added the following housekeeping line to disable core dumps:
set ulimit -c 0