Scott,
Over the weekend of April 10-11, a message appeared when I tried to log on to DCSki that your server was down and that the site would be inaccessible for a period of time. Yet less than 24 hours later, the site was up and running!
Did you apply a short-term band-aid or a longer-term fix?
Woody
Yeah, there was a bit of excitement this weekend!
Late Friday night, just as I was about to go to bed, the Network Operations Center at the facility that houses my server called to let me know my server had just stopped responding. They sent a tech to the cage and reported the server's lights were off, and that it wouldn't boot.
I was up most of Friday night trying to diagnose and implement a contingency plan, which involves setting up an alternate site (which provided the outage message you saw). The outage message was being delivered within 11 or 12 hours of the initial failure; it can take that long to re-propagate a domain name to a new location.
On Saturday, I drove down to Virginia where the server lives, and brought a spare parts kit I had (thankfully) purchased six years ago when I bought the server (an Apple XServe). We pulled the server off the rack, and I swapped out its PMU battery and power supply. Hit the power button, and it roared back to life -- phew! Apple has made it very easy to swap out parts in the XServe -- it took less than a minute to pop out the power supply and put in a new one. (Newer versions of the XServe come with an option to have two live power supplies, so it switches over to a backup instantly if the primary fails -- I think I'll buy that option in the future if/when I upgrade the server.)
So after running non-stop for over 50,000 hours, the power supply had decided it was time to retire. I was thankful I had the spare, and that it was still working -- it had been stored in a very hot attic all this time. (I was crawling around my attic at 2 a.m. searching for it!) I hope to find another spare I can buy to have on hand in case this happens again, since power supplies are the most likely thing to fail.
It wasn't a fun weekend; it was pretty stressful, because there was the possibility of losing data depending on the nature of the failure, going weeks without primary service, and the prospect of incurring a large expense to get things back on the ground. (As it is, I'm expecting a large bill from the colocation facility for the emergency weekend/after-hours assistance, and also had to purchase service for the alternate/backup site.) I've always worried about a hardware failure, because obtaining access to the server quickly is challenging; it's located in the equivalent of a digital Fort Knox two hours away in Virginia. Hopefully it's good for another 6 years, though! And it was nice to be able to have live support from the companies I contract with throughout the night and weekend. Even though the failure occurred late on a Friday, the server was 100% back with no data loss within 17 hours.