System Administration – Samposium

Using fail2ban to mitigate simple DOS attacks against apache (or why I am a terrible sysop)

Sam — Mon, 06 Jun 2016 21:49:49 +0000

If you aren’t interested in the dramatic reconstruction of the events that led me to try and configure a simple DOS jail with fail2ban, you can skip to the fail2ban apache DOS jail mitigation.
If you just want to know what happened without the flair, skip to the timeline.

Earlier this afternoon, my server was upset. At 15:57, a duo of IP addresses begun making rapid and repeated POST requests to an auxiliary component of WordPress, forcing apache to begin consuming significant amounts of system memory. Disappointingly this went undetected, and less than half an hour later, at 16:24, the system ran out of memory, invoked the OOM killer and terminated mysqld. Thus at 16:24, denial of service to all applications requiring access to a database was successful.

Although the server dutifully restarted mysqld less than a minute later, the attack continued. Access to apache was denied intermittently (by virtue of the number of requests) and the OOM killer terminated mysqld again at 16:35. The database server daemon was respawned once more, only to be killed just short of half an hour later at 17:03.

It wasn’t until 17:13 that I was notified of an issue, by means of a Linode anomaly notification, disk I/O had been unusually high for a two hour period. I was away from my terminal but used my phone to check my netdata instance. Indeed I could confirm a spike in disk activity but it appeared to have subsided. I had run some scripts and updates (which can occasionally trigger these notifications) in the previous two hours so assumed causation and dismissed the notification. Retrospectively, it would be a good idea to have some sort of check list to run through upon receipt of such a message, even if the cause seems obvious.

The attack continued for the next hour and a half, maintaining denial of the mysqld service (despite the respawner’s best effort), at 18:35 (two and a half hours after the attack began) I returned from the field to my terminal and decided to double check the origin of the high disk I/O. I loaded the netdata visualiser (apache seemed to be responsive) and load seemed a little higher than usual. Disk I/O was actually higher than usual, too. It would seem that I had become a victim of y-axis scaling; the spike I had dismissed as a one-off burst in activity earlier had masked the increase in average disk I/O. Something was happening.

I checked system memory, we were bursting at the seams. The apache process was battling to consume as much memory on the system as possible. mysqld appeared to be in a state of flux, so I tried to reach database backed applications; Phabricator, and my blog – both returned some form of upset “where is my database” response. I opened the syslog and searched for evidence that the out of memory killer had been swinging its hammer. At this point I realised this was a denial of service.

I located the source of the high disk I/O when I opened the apache access log. My terminal spewed information on POST requests to xmlrpc.php aimed at two WordPress sites hosted on my server. I immediately added iptables rules for both IP addresses, and two different IPs from the same block took over the attack. I checked the whois and discovered all the origin IPs were in the same assigned /24 block, so I updated iptables with a rule to drop traffic from the whole block. The requests stopped and I restarted the seemingly mangled mysqld process.

I suspect the attack was not aimed at us particularly, but rather the result of a scan for WordPress sites (I am leaning towards for the purpose of spamming). However I was disappointed in my opsec-fu, not only did I prevent this from happening, but I failed to stop it happening for over two hours. I was running OSSEC, but any useful notifications failed to arrive in time as I had configured messages to be sent to a non-primary address that GMail must poll from intermittently. A level 12 notification was sent 28 minutes after the attack started as soon as the OOM was invoked for the first time, but the message was not pulled to my inbox until after the attack had been stopped.

The level of traffic was certainly abnormal and I was also frustrated that I had not considered configuring fail2ban or iptables to try and catch these sort of extreme cases. Admittedly, I had dabbled in this previously, but struggled to strike a balance with iptables that did not accidentally ban false positives attempting to use a client’s web application. Wanting to combat this happening in future, I set about to implement some mitigations:

Mitigation Implementation

Configure a crude fail2ban jail for apache DOS defence

My first instinct was to prevent ridiculous numbers of requests to apache from the same IP being permitted in future. Naturally I wanted to tie this into fail2ban, the daemon I use to block access to ssh, the mail servers, WordPress administration, and such. I found a widely distributed jail configuration for this purpose online but it did not work; it didn’t find any hosts to block. The hint is in the following error from fail2ban.log when reloading the service:

fail2ban.jail   : INFO   Creating new jail 'http-get-dos'
...
fail2ban.filter : ERROR  No 'host' group in '^ -.*GET'

The regular expression provided by the filter (failregex) didn’t have a ‘host’ group to collect the source IP with, so although fail2ban was capable of processing the apache access.log for lines containing GET requests, all the events were discarded. This is somewhat unfortunate considering the prevalence of the script (perhaps it was not intended for the combined_vhost formatted log, I don’t know). I cheated and added a CustomLog to my apache configuration to make parsing simple whilst also avoiding interference with the LogFormat of the prime access.log (whose format is probably expected to be the default by other tooling):

LogFormat "%t [%v:%p] [client %h] \"%r\" %>s %b \"%{User-Agent}i\"" custom_vhost
CustomLog ${APACHE_LOG_DIR}/custom_access.log custom_vhost

The LogFormat for the CustomLog above encapsulates the source IP in the same manner as the default apache error.log, with square brackets and the word “client”. I updated my http-get-dos.conf file to provide a host group to capture IPs as below (I’ve provided the relevant lines from jail.local for completeness):

I tested the configuration with fail2ban-regex to confirm that IP addresses were now successfully captured:

$ fail2ban-regex /var/log/apache2/custom_access.log /etc/fail2ban/filter.d/http-get-dos.conf
[...]
Failregex
|- Regular expressions:
|  [1] \[[^]]+\] \[.*\] \[client \] "GET .*
|
`- Number of matches:
   [1] 231 match(es)
[...]

It works! However when I restarted fail2ban, I encountered an issue whereby clients were almost instantly banned when making only a handful of requests, which leads me to…

How to badly configure fail2ban

This took some time to track down, but I had the feeling that for some reason my jail.conf was not correctly overriding maxretry – the number of times an event can occur before the jail action is applied, which by default is 3. I confirmed this by checking the fail2ban.log when restarting the service:

fail2ban.jail   : INFO   Creating new jail 'http-get-dos'
...
fail2ban.filter : INFO   Set maxRetry = 3

Turns out, the version of the http-get-conf jail I had copied from the internet into my jail.conf was an invalid configuration. fail2ban relies on the Python ConfigParser which does not support use of the # character for an in-line comment. Thus lines such as the following are ignored (and the default is applied instead):

maxretry = 600 # 600 attempts in
findtime = 30  # 30 seconds (or less)

Removing the offending comments (or switching them to correctly-styled inline comments with ‘;’) fixed the situation immediately. I must admit this had me stumped and seems pretty counter-intuitive especially as fail2ban doesn’t offer a warning or such on startup either. But indeed, it appears in the documentation, so RTFM, kids.

Note that my jail.local above has a jail for http-post-dos, too. The http-post-dos.conf is exactly the same as the GET counterpart, just the word GET is replaced with POST (who’d’ve thought). I’ve kept them separate as it means I can apply different rules (maxretry and findtime) to GET and POST requests. Note too, that even if I had been using http-get-dos today, this wouldn’t have saved me from denial of service, as the requests were POSTs!

Relay access denied when sending OSSEC notifications

As mentioned, OSSEC was capable of sending notifications but they were not delivered until it was far too late. I altered the global ossec.conf to set the email_to field to something more suitable, but when I tested a notification, it was not received. When I checked the ossec.log, I found the following error:

ossec-maild(1223): ERROR: Error Sending email to xxx.xxx.xxx.xxx (smtp server)

I fiddled some more and in my confounding, located some Relay access denied errors from postfix in the mail.log. Various searches told me to update my postfix main.cf with a key that is not used for my version of postfix. This was not particularly helpful advice, but I figured from the ossec-maild error above that OSSEC must be going out to the internet and back to reach my SMTP server and external entities must be authorised correctly to send mail in this way. To fix this, I just updated the smtp_server value in the global OSSEC configuration to localhost:


  
    yes
    [email protected]
    localhost
    [email protected]
  
...

Deny traffic to xmlrpc.php entirely

WordPress provides an auxiliary script, xmlrpc.php which allows external entities to contact your WordPress instance over the XML-RPC protocol. This is typically used for processing pingbacks (a feature of WordPress where one blog can notify another that one of its posts has been mentioned), via the XML-RPC pingback API, but the script also supports a WordPress API that can be used to create new posts and the like. For me, I don’t particularly care about pingback notifications and so can mitigate this attack in future entirely by denying access to the file in question in the apache VirtualHost in question:


...
    
        order allow,deny
        deny from all

tl;dr

Timeline

1557 (+0'00"): POSTs aimed at xmlrpc.php for two WordPress VirtualHost begin
1624 (+0'27"): mysqld terminated by OOM killer
1625 (+0'28"): OSSEC Level 12 Notification sent
1625 (+0'28"): mysqld respawns but attack persists
1635 (+0'38"): mysqld terminated by OOM killer
1636 (+0'39"): mysqld respawns
1700 (+1'03"): OSSEC Level 12 Notification sent
1703 (+1'06"): mysqld terminated by OOM killer
1713 (+1'16"): Disk IO 2-Hour anomaly notification sent from Linode
1713 (+1'16"): Linode notification X-Received and acknowledged by out of office sysop
1835 (+2'38"): Sysop login, netdata accessed
1837 (+2'40"): mysqld terminated by OOM killer, error during respawn
1839 (+2'42"): iptables updated to drop traffic from IPs, attack is halted briefly
1840 (+2'43"): Attack continues from new IP, iptables updated to drop traffic from block
1841 (+2'44"): Attack halted, load returns to normal, mysqld service restarted
1842 (+2'45"): All OSSEC notifications X-Received after poll from server

Attacker

POST requests originate from IPs in an assigned /24 block
whois record served by LACNIC (Latin America and Caribbean NIC)
Owner company appears to be an “Offshore VPS Provider”
Owner address and phone number based in Seychelles
Owner website served via CloudFlare
GeoIP database places attacker addresses in Chile (or Moscow)
traceroute shows the connection is located in Amsterdam (10ms away from vlan3557.bb1.ams2.nl.m24) – this is particularly amusing considering the whois owner is an “offshore VPS provider”, though it could easily be tunneled via Amsterdam

Suspected Purpose

Spam: Attacker potentially attempting to create false pingbacks (to link to their websites) or forge posts on the WordPress blogs in question
Scan hit-and-run: Scan yielded two xmlrpc.php endpoints that could be abused for automatic DOS

Impact

Intermittent apache stability for ~3 hours
Full service denial of mysql for ~2.25 hours
Intermittent disruption to email for ~2 hours

Failures

No monitoring or responsive control configured for high levels of requests to apache
OSSEC configured to deliver notifications to non-primary address causing messages that would have prompted action much sooner to not arrive within actionable timeframe
Failed to recognise (or consider) disk I/O anomaly message as a red herring for something more serious
Forgetting that the attack surface for WordPress is always bigger than you think

Positives

Recently installed netdata instance immediately helped narrow the cause down to apache based activity
Attack mitigated in less than five minutes once I actually got to my desk

Mitigations

OSSEC reconfigured to send notifications to an account that does not need to poll from POP3 intermittently
Added simple GET and POST jails to fail2ban configuration to try and mitigate such attacks automatically in future
Drop traffic to offending WordPress script to reduce attack surface
Develop a check list to be followed after receipt of an anomaly notification
Develop a healthy paranoia against people who are out to get you and be inside your computer (or make it fall over)
Moan about WordPress

Mitigation Tips and Gotcha’s

Set your OSSEC notification smtp_server to localhost to bypass relay access denied errors
Make use of fail2ban-regex to test your jails
NEVER use # for inline comments in fail2ban configurations, the entire line is ignored
If you are protecting yourself from GET attacks, have you forgotten POST?

How to switch SATA controller driver from RAID to AHCI on Windows 10 without a reinstall

Sam — Thu, 14 Jan 2016 03:50:57 +0000

This advice was posted in 2016 and worked for me. I’m aware that it has worked for others, but for several people (see comments), it wrecked their computer. Before following this advice, you should know what you are doing and be confident and prepared to restore your computer from a good backup. Good luck!

For those who don’t care for the learning adventure, just skip to my guide to switch storage controller driver without wrecking Windows.

This evening, I was bemused to find a Linux live disk unable to identify the storage volume on my new Dell XPS 13 laptop. A quick search introduced me to a problem I have not encountered before; the SSD was likely configured to use a SATA controller mode that did not have a driver in the kernel of the live disk installer. This is typically when the stock disk has been shipped in either IDE (for backwards compatibility purposes) or a vendor specific RAID mode, instead of the native Advanced Host Controller Interface (AHCI) that exposes some of SATAs more advanced features.

One can easily change this setting in the BIOS. On my XPS I had to navigate to System Configuration > SATA Configuration and switch the radio button selection from RAID On to AHCI. A rather scary warning informed me that this would more than likely break my existing partitions. As a curious scientist with a recovery partition as a safety net, I decided to proceed anyway. Unsurprisingly, Windows 10 failed to boot, electing to display the dreaded sideways smiley face and a suggestion that I read up about the INACCESSIBLE_BOOT_DEVICE error. Oops.

It turns out, to optimize boot times, Windows disables drivers that are deemed unnecessary for startup during installation. Herein lies the problem, if the OS is installed while the disk is in one of these other modes (in my case RAID), the driver that would allow us to speak AHCI to our speaking AHCI-speaking SATA storage controller is effectively disabled (even though it is installed). Windows, without the ability to communicate with the disk correctly, has no real option but to fall on its side with a glum expression and throw the INACCESSIBLE_BOOT_DEVICE error during startup. The accusations are corroborated by the Wikipedia article on the subject of AHCI:

Some operating systems, notably Windows Vista, Windows 7, Windows 8 and Windows 10 do not configure themselves to load the AHCI driver upon boot if the SATA-drive controller was not in AHCI mode at the time of installation. This can cause failure to boot, with an error message, if the SATA controller is later switched to AHCI mode.

So what are we to do? If I want to install and run Linux, I need my SSD’s SATA controller to be set to AHCI¹. Yet if I want to dual-boot with Windows, I need to use RAID to match the currently installed Intel vendor driver. A conundrum!

Official advice from vendors like Intel is that you should format the disk, set the controller mode as desired and then reinstall the Windows operating system. But this seems somewhat of a cop out, what if lazy people like me don’t have physical installation media to hand, or don’t want to go through the hassle of a format and reinstall? Evidently, I am not the first to ask this question; as there are many threads online that attempt to achieve this for Windows 10², with varying degrees of success garnered from fiddling around in the registry (and variants thereof) to merely booting into safe mode and back. Unfortunately, none of these fixes worked for me and so I worked to come up with my own:

Sam’s super easy guide to switching your SATA Controller from `RAID` to `AHCI` without destroying your Windows 10 disk

Boot to Windows with your current SATA controller configuration
Open Device Manager
Expand Storage Controllers and identify the Intel SATA RAID Controller
View properties of the identified controller
On the Driver tab, click the Update driver… button
Browse my computer…, Let me pick…
Uncheck Show compatible hardware
Select Microsoft as manufacturer
Select Microsoft Storage Spaces Controller as model³
Accept that Windows cannot confirm that this driver is compatible
Save changes, reboot to BIOS and change RAID SATA Controller to AHCI
Save changes and reboot normally, hopefully to Windows

If you’ve exhausted your luck elsewhere, I hope this works for you as it did for me, but your mileage will almost certainly vary.

1. Confusingly, according to the AHCI article on Wikipedia:
  
  Intel recommends choosing RAID mode on their motherboards (which also enables AHCI) rather than AHCI/SATA mode for maximum flexibility.
  
  If this really is the case, why doesn’t our trusty Linux live disk installer identify the dual-wielding AHCI and RAID disk in question? I wisely chose to stop at the entrance to the rabbit hole on this occasion and was just happy I could move on with my Linux installation.

1. Other articles exist for Windows 7 and Windows 8 too, but are generally disregarded as unhelpful for Windows 10. In particular, the default AHCI driver provided by Microsoft changed name between versions 7 and 8, so much of the advice pertains to registry keys and files that don’t exist if followed for versions 8 and 10.

A more specific driver from your vendor may be available for your storage controller.

`memblame`

Sam — Sun, 26 Apr 2015 10:06:36 +0000

As a curious and nosy individual who likes to know everything, I wrote a script dubbed memblame which is responsible for naming and shaming authors of “inefficient”¹ jobs at our cluster here in IBERS.

It takes time, often days, sometimes longer, of patience to see large-input jobs executed on a node on the compute cluster here. Typically this is down to the amount of RAM requested, only a handful of nodes are actually capable of scheduling jobs that have a RAM quota of 250GB or larger. But these nodes are often busy with other tasks too.

One dreary afternoon while waiting a particularly long time for an assembly to pop off the queue and begin, I started to wonder what the hold up was.

Our cluster is underpinned by Sun Grid Engine (SGE), a piece of software entrusted with scheduling and management of submitted jobs that over the past few months I have formed a strong opinion on². When a job completes (regardless of exit status), SGE stores associated job meta-data in plain-text in an “accounting” logfile on the cluster’s root node.

The file appeared trivially parseable³ and offered numerous fields for every job submitted to the node since its last boot⁴. Primed for procrastination with mischief and curiosity, I knocked up a Python-based parser and delivered memblame.

The script dumps out a table detailing each job with the following fields as columns:

Field	Description
jid	SGE Job ID
node	Hostname of Execution Node
name	Name of Job Script
user	Username of Author
gbmem_req	GB RAM Requested
gbmem_used	GB RAM Used
delta_gbmem	ΔGB RAM (Requested − Used)
pct_mem	%GB Requested RAM Utilised
time	Execution Duration
gigaram_hours	GB RAM Used × Execution Hours
wasted_gigaram_hours	GB RAM Unused × Execution Hours
exit	Exit Status (0 if success)

The table introduces the concept of wasted_gigaram_hours, defined as the number of RAM gigabytes unused (where RAM “used” is defined as equal to peak RAM usage as measured by the scheduler over the duration of the job⁵, unused therefore being the difference between RAM requested and utilised; delta_gbmem) multiplied by the number of hours the job ran for. Thus a job that over-requested 1GB of RAM and runs for a day, “wastes” 24 GB Hours!

I created this additional field in an attempt to more fairly compare different classes of job that take vastly different execution times to complete. i.e. Jobs that use (and over-request) large amounts of RAM but for a short time should not necessarily be shamed more than smaller jobs that over-request less RAM for a much longer period of time.

Incidentally, at the time of publishing the 1st Monthly MemBlame Leaderboard, no matter on the field used to order the rankings, a member of our team who shall remain nameless⁶ won the gold medal for wastage.

Though it wasn’t necessarily the top of the list that was interesting. Although naming and shaming those responsible for ridiculous RAM wastage (~0.76 TB Day^-1 over 11 days⁶) on an assembly job that didn’t even complete successfully⁶ is fun in jest, memblame revealed user behaviours such as a tendancy to request the default amount of RAM for small jobs such as BLASTing — up to ~5x more RAM than necessary — which easily tied up resources on smaller nodes when running many of these jobs in parallel. In the long run I’d like to use this sort of data to improve guess-timates on resource requests for large and long running jobs in an attempt to reduce resource hogging for significant periods of time when completing big assemblies and alignments.

I should add that “wasted RAM” is just one of the many dimensions we could look at when discussing job “efficiency”⁷. I chose to look at RAM underuse for this particular situation as in my opinion it appears to be the weakest resource in the setup that we have and the one with which users seem to struggle the most in estimating usage of.

If nothing else it promotes a healthy discussion about the efficiency of the tools that we are using and the opportunity to poke some light hearted fun at people who lock up 375GB of RAM over the course of two hours running a poorly parameterised sort⁸.

tl;dr

I wrote a script to name and shame people who asked for more RAM than they needed.

Although properly determining a metric to fairly represent efficiency is a task in itself. ↩
I’m also writing software with the sole purpose of abstracting away having to deal with SGE entirely. ↩
In fact the hardest part was digging around to locate a manual to actually decipher what each field represented and how to translate them to something human readable. ↩
Which seems to be correlated with the date of Aberystwyth’s last storm. ↩
It’s likely that jobs are even less “efficient” than as reported by memblame as scripts probably don’t uniformly utilise memory used over a job’s lifetime. Unfortunately max_vmem is the only metric for RAM utilisation that can be extracted from SGE’s accounting file. ↩
I’m sorry, Tom. ↩ ↩ ↩
Although properly determining a metric to fairly represent efficiency is a task in itself. ↩
That one was me. ↩

System Administration – Samposium

Using fail2ban to mitigate simple DOS attacks against apache (or why I am a terrible sysop)

Mitigation Implementation

Configure a crude fail2ban jail for apache DOS defence

How to badly configure fail2ban

Relay access denied when sending OSSEC notifications

Deny traffic to xmlrpc.php entirely

tl;dr

Timeline

Attacker

Suspected Purpose

Impact

Failures

Positives

Mitigations

Mitigation Tips and Gotcha’s

How to switch SATA controller driver from RAID to AHCI on Windows 10 without a reinstall

Sam’s super easy guide to switching your SATA Controller from RAID to AHCI without destroying your Windows 10 disk

`memblame`

tl;dr

Sam’s super easy guide to switching your SATA Controller from `RAID` to `AHCI` without destroying your Windows 10 disk