STATIC DISCLAIMER: All the stuff in here is purely my opinions, and they tend to change depending on what mood I'm in. If you're going to get bitter if I say something about you that you don't like, then maybe don't read. I avoid using names as much as possible, and would request that people who know me do the same in their comments. Basically, I often vent my frustrations on here, so if you happen to be someone who frustrates me, expect to read a description of someone very much like you in here!

Wednesday, January 18, 2006

Work looks different at 3am

It all started at about 10am on Tuesday morning. I was sitting at my desk, which is in close proximity to the server cabinet, basking in the success of our recent domain rename operation. Meanwhile, my colleague was messing around at the rear of the stack of humming, very live, servers.
Suddenly, there was a beep.
Not just a short beep, but a continuous "something is very wrong" beep. For a second, I thought it must have just been one of the various desktops/laptops lying around, but I soon realised it was coming from the servers.
"What's that? Is that the file server?" I asked, noticing my colleague's slightly bewildered expression.
I quickly switch our server console over to the machine in question. Blue Screen Of Death. Not good. It was then I realised what had happened - My colleague had just plugged the SCSI cable from our new tape library into the RAID controller on our file server while it was all live. This was going to be a bad day...
My first reaction was the most obvious one: unplug tape library, reboot server. So that's what I do. During the boot up process, I see a line that reads something like this:
Array #1:BIGDATA - Failed

Following that, the server tries to boot, but with no success. Why? Because there's no disk to boot off anymore. It's gone. I load up the array diagnostics, but it tells me what I already suspected. Our array of 10 drives is spread over 2 channels; 5 drives on each. When my colleague had plugged in the tape library, 5 of those drives had simultaneously gone offline, and so the array had immediately become cactus. See, in a 10 disk array, you can lose up to 3 disks at once and still have it continue to work, but if you lose more then that the array fails, and you have to start from scratch.
Now, I didn't want to believe that this had happened at first, so I stuffed around with various utilities and progressed to firmware updates only to come back to what I knew to begin with: 5 drives fail = dead array. The array data was irreconcilably inconsistent, and the controller continued to insist that there was no possible way that sucka was coming back to life.
It's at this point that I discover that my boss is useful for something I'd never though of before: damage control. I quickly set him up to play diplomat so that I can get on with doing the work we're actually employed to do. Meanwhile, I shoot my colleague a couple of pointed comments just to hammer home that this is his fault, and I'm not happy. Bad, I know, but I was cranky. I did apologise later...
So we move on to trying to rebuild. We have the technology, we can rebuild it. However, during the process of our recent domain rename, we'd moved every possible essential Active Directory function on to the server that was now sitting mute in front of me. Crap. This could be a problem. So I ring a friend of mine who is wise in the ways of AD, and after some puzzled musing over why on earth we don't have a backup of our AD database (I SO should have been in charge of backups), he gives me the following solution: Make with the global catalogue, seize the FSMO roles, and pray for a replica.
So I jump onto one of our other DCs and do just that. Luckily for us, there was an up-to-date replica of our policy data on the server, so the GC was able to rebuild itself and for the most part our Active Directory was good from there on out. However, there was still the issue of the dead server...
So I start on a rebuild. Building a server doesn't normally take that long, but having to deal with a panicking colleague in the background can string it out a bit. Couple that with the fact that the server's driver CD decided it should downgrade the baseboard management controller firmware and cause all kinds of trouble, and it took me about 2 hours to get it up to being a fairly bare system with DNS and a DC role. From here, we moved onto the backups...
I don't know if you've ever done much with computers, but when you start talking about shifting 200Gb files around, time seems to come pretty much to a standstill. See, our backup strategy has consisted of three 500Gb firewire hard disks that are cycled on a triweekly basis. So we've got this 200Gb backup file sitting on this firewire hard disk. But here's the catch - apparently, the disks have been failing of late. First I'd heard of it. So although there's a backup file there now, it may disappear if we start restore operations straight off the disk. Well, we can't have that happening, so we better make a copy before we do anything else. At about 4pm, we started the copy process. I'm not 100% sure why, but it finished at 9pm. 5 hours of waiting. During this time, I found out that Winamp has a selection of video feeds you can watch, and there's a feed that plays Futurama 24/7. Nice.
Anyway, after several coffees, a pizza and some Coke, the copy is finished and we start the restore process. People's home directories come back with full permissions and ownership intact. It's a beautiful, beautiful thing. However, this short-lived goodness was not to last. The restore finished about 12:30am, leaving just one thing to do - rebuild Exchange.
Now at this point I'd officially decided that when I got Exchange restored, I was going to go home and not come to work again for at least 24 hours. However, the first part of that preposition turned out to be the problem. Exchange wouldn't restore. It just wouldn't. I tried over and over to get the mailbox stores to mount, but finally at 3am I gave up and headed home.
Midday the next day and I'm back at my console trying once again to make Exchange play nice with all the other friendly services. Over night, Active directory has kept itself replicated and playing nicely without throwing any errors at all other then some bollocks about having duplicate SQL instances registered for the same server, but I'll deal with that later. Meanwhile, my well-meaning boss has organised to pay stupid quantities of money to have a senior tech from Commander come out and have a look. On the phone, this guy spoke to me like I was an idiot, so I wasn't looking forward to the experience. When he arrived, he continued with this level of superiority while he ran through everything I had just done. Three hours, and at least a score of Google searches later, he told my boss that "...Justin has a fairly firm grasp of the issues involved, and maybe talking to Microsoft directly would be our best bet from here." Thanks very much, here's the bill, see you later.
By this time, I decide it's time for me to suggest a course of action. So I do. We'll hive off the old databases, rebuild the information store from scratch, give everyone new mailboxes, and get the system back up and running so we don't have another day of messages disappearing into the ether. We'll then look into extracting mailboxes from the offline database file to PST files (which must be done fairly easily from the number of companies offering the service) so that those PST files can then be merged into the new information store. Brilliant. How come I'm the only one who thinks of these ideas? Oh, and just as a backup, I tell Stuart that tomorrow he should spend the day extracting the cached copies of mailboxes from the Executive's laptops so that we can get their emails back even quicker. Everyone nods and accepts the fact that I am t3h l33t. Exchange is my biatch, I tell you.
Meanwhile the end result of this is that I'll probably leave St Paul's with people singing my praises while scowling menacingly at the other IT members. I like that. It works for me. The other result is that I'm crazy tired, and yet I'm up at 1am tonight just becuase my body thinks it's a good idea. w00t for coffee is all I can say.

Alright. Time for bed.

2 comments:

Nathan Zamprogno said...

What a saga! It's the stuff of nightmare ,and thank God you had a backup. IT people should have a patron saint. Perhaps there already is one.

m said...

Wow. I didn't understand some of the technical speak in your post, so I can only imagine the pain you went through. Considering that today, you work email was sending "cannot deliver" messages, I'm guessing things are still not well...