Post your IT redundancy tales here

David Gerard · 5 months ago

Post your IT redundancy tales here

@[email protected] · 5 months ago

At a previous job a colleague and I used to take on most of the physical data center work. Many of the onprem customers were moving to public cloud, so a good portion of the work was removing hardware during decommissioning.

We wanted to optimize out use of the rack space, so the senior people decided we would move one of our storage clusters to the adjacent rack during a service break night. The box was built for redundancy, with dual PSUs and network ports, so we considered doing the move with the device live, with at least one half of the device connected at all times. We settled on a more conventional approach and one of the senior specialists live migrated the data on another appliance before the move.

Down in the DC the other senior showed us what to move and where and we started to carefully unplug the first box. He came to check on us just after we had taken out the first box.

Now I knew what a storage cluster appliance looked like, having carried our old one out of the DC not too long ago. You have your storage controller, with the CPU and OS and networking bits on it, possibly a bunch of disk slots too, and then you had a number of disk shelves connected to that. This one was quite a bit smaller, but that’s just hardware advancement for you. From four shelves of LFF SAS drives to some SSDs. Also the capacity requirements were trending downwards what with customers moving to pubcloud.

So we put the storage controller to its new home and started to remove the disk shelf from under it. There was a 2U gap between the controller and the shelf, so we decided to ask if that was on purpose and if we should leave a gap in the new rack as well.

“What disk shelf?”

Turns out the new storage appliance was even smaller than I had thought. Just one 2U box, which contained two entire independent storage controllers, not just redundant power and network. The thing we removed was not a part of the cluster we were moving, it was the second cluster, which was currently also handling the duties of the appliance we were actually supposed to move. Or would have, if we hadn’t just unplugged it and taken it out.

We re-racked the box in a hurry and then spent the rest of the very long night rebooting hundreds of VMs that had gone read only. Called in another specialist, told the on-duty admin to ignore the exploding alarm feed and keep the customers informed, and so on. Next day we had a very serious talk with the senior guy and my boss. I wrote a postmortem in excruciating detail. Another specialist awarded me a Netflix Chaos Monkey sticker.

The funny thing is that there was quite reasonable redundancy in place and so many opportunities to avert the incident, but Murphy’s law struck hard:

We had decomm’d the old cluster not a long ago, reinforcing my expectation of a bigger system.
The original plan of moving the system live would have left both appliances reachable at all times. Even if we made a mistake, it would have only broken one cluster’s worth of stuff.
Unlike most of the hardware in the DC, the storage appliances were unlabeled.
The senior guy went back to his desk right before we started to unwittingly unplug the other system
The other guy I was working with was a bit unsure about removing the second box, but thought I knew better and trusted that.