A week of Murphy's Law gone wild.
Chapter One. The Linux server hosting our CVS repository (all our source code) fails. No big deal, it is automatically mirrored (using rdist) to a remote location. It takes a few hours to compress and transmit the mirrored data. We discover that we forgot the option to rdist that removes deleted files, so the mirror isn't perfect: it includes files that were deleted. These have to be manually removed.
When this is all done I decide to check out the whole source tree from scratch and compare it to what I already have, as a final sanity check. But I don't have enough disk space on my laptop to do this. Time to upgrade. I order a 60 GB laptop hard drive and a PCMCIA/harddrive connector that is supposed to allow you to clone the old hard drive on the new one. This process takes something like 6 hours and fails when it is 50% complete, instructing me to "run scandisk." Which takes a couple of hours. Start another copy. 6 hours more. At 50%, it fails again. Only now, the original hard drive is toast taking my entire life with it. It takes a couple of hours fiddling around, putting the drive into different computers, etc., to discover that it is indeed lost.
OK, not too big a deal, we have daily backups (NetBackup Pro). I put the new 60 GB drive into the laptop, format it, and install Windows XP Pro. I instruct NetBackup Pro to restore that machine to its pre-crash state. I'll lose a day of work, but it was a day in which I hardly got anything done, anyway. A day of email was lost so if you sent me something this week and I didn't respond, resend.
NetBackup Pro works for a few hours. I go home, to let it finish overnight. In the morning the system is completely toast and won't even boot. I hypothesize that it must be because I tried to restore a Win2K image on top of an XP Pro OS. So I start again, this time installing Win 2K (format hard drive: 1 hour; install Win 2K: 1 hour; then install the NetBackup Pro Client). And I start the restore again. Five hours later, it's only halfway done, and I go home.
The next morning, the system doesn't quite boot, it blue-screens, but a half hour of fiddling around with Safe Mode and I get it to boot happily. And behold, everything is restored, except, for some reason, a few files which I let Windows encrypt for me (using EFS) are inaccessible. This has something to do with public keys and certificates. When you restore a file that was encrypted I guess you can't read it. I still haven't found the solution to this. If you know how to fix this I will be forever indebted to you. [1/26: I fixed this problem after a few hours of tearing out my hair.]
Lesson Learned: This is not the first time that a hard drive failure has led to a series of other problems that wound up wasting days and days of work. Notice that I had a very respectable backup strategy, everything was backed up daily, offsite. In fact I believe this is the third time that a hard drive failure has led to a series of mishaps that wasted days. Conclusion: backups aren't good enough. I want RAID mirroring from now on. When a drive dies I want to spend 15 minutes putting in a new drive and resume working exactly where I left off. New policy: all non-laptops at Fog Creek will have RAID mirroring.
Chapter Two. Did you notice that our web server was down? On Friday around noon a fire in a local Verizon switch knocked out all our phone lines and our Internet connectivity. Verizon got the phone lines working in a couple of hours, but the T1 was a bit more problematic. We purchased the T1 from SAVVIS, which, in turn, hired MCI to run the local loop, which is now called WorldCom, and of course Worldcom doesn't actually run any loops, God forbid they should get their hands dirty, they just buy the local loop from Verizon.
So from Friday at noon until Saturday at midnight, Michael and I, working as a tag team, call Savvis every hour or so to see what's going on. We're pushing on Savvis, who, occassionally, push on Worldcom, who have decided that some kind of SQL Server DDOS attack can be blamed for everything, so they kind of ignore Savvis, who don't tell us that Worldcom is ignoring them, and we push on Savvis again, and they push on Worldcom again, and around the third time Worldcom agrees to call Verizon who send out a tech who fixes the thing. Honestly, it's like pushing on string. Just like the last time Savvis made our T1 go down for a day, the technical problem was relatively trivial and could have been diagnosed and fixed in minutes if we weren't dealing with so many idiot companies.
Lesson Learned: When you're buying a service from a company that's just outsourcing that service, one level deep, it's difficult to get decent customer service. When there are two levels of oursourcing, it's nearly impossible. Much as I hate to encourage monopolistic local telcos, the only thing worse than dealing with a local telco directly is dealing with another idiot bureaucratic company who themselves have no choice but to deal with the local telco. Our next office space will be wired by Verizon DSL, thank you very much.
Incidentally, none of you would have noticed this outage at all if Dell had delivered our damn server on time. We were supposed to be up and running in a nice Peer1 highly redundant secure colocation facility a month ago. See previous rant. Did I mention that I have a fever? I always get sick when things are going wrong.
Chapter Three. For the thousandth time, the heat on the fourth floor of the Fog Creek brownstone is out. Heat is supplied by hot water pipes running through the walls. These pipes were frozen solid. How did they get a chance to freeze? Oh, that's because the furnace went off last week, because it was installed by an idiot moron, probably unlicensed, who put in a 25 foot long horizontal chimney segment which prevents ventilation and has, so far, hospitalized one tenant and caused the furnace to switch off dozens of times. Finally someone at the heating company admitted that it was possible to install a draft inducer forcing the chimney to ventilate, which they did, but not before the hot water pipes had frozen. Of course, the pipes are inadequately insulated due to another incompetent in the New York City construction trade, but this wouldn't have mattered if the furnace had kept running.
Lesson Learned: Weak systems may appear perfectly healthy until neighboring systems break down. People with allergies and back problems may go for months without suffering from either one, but suddenly an attack of hayfever makes them sneeze hard enough to throw out their back. You see this in systems administration all the time. Use these opportunities to fix all the problems at once. Get RAID on all your PCs and do backups, and don't use EFS and always get hard drives that are way too large so you'll never have to stop to upgrade them, and double check the command line options to rdist. Install the draft inducer and insulate the pipes. Move your important servers to a secure colo facility and switch the office T1 to Verizon.
You’re reading Joel on Software, stuffed with years and years of completely raving mad articles about software development, managing software teams, designing user interfaces, running successful software companies, and rubber duckies.
I’m Joel Spolsky, co-founder of Fog Creek Software, a New York company that proves that you can treat programmers well and still be highly profitable. Programmers get private offices, free lunch, and work 40 hours a week. Customers only pay for software if they’re delighted. We make Trello, insanely simple project management, FogBugz, an enlightened bug tracker designed to help great teams develop brilliant software, and Kiln, which simplifies source control. I’m also the co-founder and CEO of Stack Exchange. More about me.