`memblame`

   Sam Nicholls    No Comments yet    System Administration, Tools

As a curious and nosy individual who likes to know everything, I wrote a script dubbed memblame which is responsible for naming and shaming authors of “inefficient”1 jobs at our cluster here in IBERS.

It takes time, often days, sometimes longer, of patience to see large-input jobs executed on a node on the compute cluster here. Typically this is down to the amount of RAM requested, only a handful of nodes are actually capable of scheduling jobs that have a RAM quota of 250GB or larger. But these nodes are often busy with other tasks too.

One dreary afternoon while waiting a particularly long time for an assembly to pop off the queue and begin, I started to wonder what the hold up was.

Our cluster is underpinned by Sun Grid Engine (SGE), a piece of software entrusted with scheduling and management of submitted jobs that over the past few months I have formed a strong opinion on2. When a job completes (regardless of exit status), SGE stores associated job meta-data in plain-text in an “accounting” logfile on the cluster’s root node.

The file appeared trivially parseable3 and offered numerous fields for every job submitted to the node since its last boot4. Primed for procrastination with mischief and curiosity, I knocked up a Python-based parser and delivered memblame.

The script dumps out a table detailing each job with the following fields as columns:

Field Description
jid SGE Job ID
node Hostname of Execution Node
name Name of Job Script
user Username of Author
gbmem_req GB RAM Requested
gbmem_used GB RAM Used
delta_gbmem ΔGB RAM (Requested − Used)
pct_mem %GB Requested RAM Utilised
time Execution Duration
gigaram_hours GB RAM Used × Execution Hours
wasted_gigaram_hours GB RAM Unused × Execution Hours
exit Exit Status (0 if success)

The table introduces the concept of wasted_gigaram_hours, defined as the number of RAM gigabytes unused (where RAM “used” is defined as equal to peak RAM usage as measured by the scheduler over the duration of the job5, unused therefore being the difference between RAM requested and utilised; delta_gbmem) multiplied by the number of hours the job ran for. Thus a job that over-requested 1GB of RAM and runs for a day, “wastes” 24 GB Hours!

I created this additional field in an attempt to more fairly compare different classes of job that take vastly different execution times to complete. i.e. Jobs that use (and over-request) large amounts of RAM but for a short time should not necessarily be shamed more than smaller jobs that over-request less RAM for a much longer period of time.

Incidentally, at the time of publishing the 1st Monthly MemBlame Leaderboard, no matter on the field used to order the rankings, a member of our team who shall remain nameless6 won the gold medal for wastage.

Though it wasn’t necessarily the top of the list that was interesting. Although naming and shaming those responsible for ridiculous RAM wastage (~0.76 TB Day-1 over 11 days6) on an assembly job that didn’t even complete successfully6 is fun in jest, memblame revealed user behaviours such as a tendancy to request the default amount of RAM for small jobs such as BLASTing — up to ~5x more RAM than necessary — which easily tied up resources on smaller nodes when running many of these jobs in parallel. In the long run I’d like to use this sort of data to improve guess-timates on resource requests for large and long running jobs in an attempt to reduce resource hogging for significant periods of time when completing big assemblies and alignments.

I should add that “wasted RAM” is just one of the many dimensions we could look at when discussing job “efficiency”7. I chose to look at RAM underuse for this particular situation as in my opinion it appears to be the weakest resource in the setup that we have and the one with which users seem to struggle the most in estimating usage of.

If nothing else it promotes a healthy discussion about the efficiency of the tools that we are using and the opportunity to poke some light hearted fun at people who lock up 375GB of RAM over the course of two hours running a poorly parameterised sort8.


tl;dr

  • I wrote a script to name and shame people who asked for more RAM than they needed.

  1. Although properly determining a metric to fairly represent efficiency is a task in itself. 
  2. I’m also writing software with the sole purpose of abstracting away having to deal with SGE entirely. 
  3. In fact the hardest part was digging around to locate a manual to actually decipher what each field represented and how to translate them to something human readable. 
  4. Which seems to be correlated with the date of Aberystwyth’s last storm. 
  5. It’s likely that jobs are even less “efficient” than as reported by memblame as scripts probably don’t uniformly utilise memory used over a job’s lifetime. Unfortunately max_vmem is the only metric for RAM utilisation that can be extracted from SGE’s accounting file. 
  6. I’m sorry, Tom. 
  7. Although properly determining a metric to fairly represent efficiency is a task in itself. 
  8. That one was me.