The beep of a server failure notification is a sysdmin’s most hated sound. What’s worse though is if you don’t get the alert. I won’t go into detail of why I didn’t get the alert (new server, old phone basically) but it meant I didn’t get the SMS alert. As with all server problems they always happen at the worst time. This one was just before I went to bed, just after I’d checked my email for the last time. When I got up this morning I was greeted by an inbox full of notifications from humans and monitoring software. The whole site was returning Not Found errors.
It didn’t take too long to find the problem.
xfs_force_shutdown(sdf,0x2) called from line 1043 of file /build/buildd/linux-ec2-2.6.31/fs/xfs/xfs_log.c. Return address = 0xc02ea4e3
Filesystem "sdf": xfs_log_force: error 5 returned.
This means that the filesystem that carries the site had died. The site is hosted on an Amazon EC2 instance, with the site and database on a separate Amazon EBS volume. The latter appears to have let out the magic smoke. Luckily it just took a quick reboot to bring it all back to life and shift it to new hardware. No data seems to have been lost.
All I can do is apologise to everyone who has been inconvenienced by this, and assure everyone that I’ve updated the monitoring procedures to ensure this doesn’t happen again. This includes sending the messages to the phone I just use for alerts, a Samsung B2100 that won’t let me down. It’s basically indestructible and the battery lasts nearly a month. Take a look!