PDA

View Full Version : "What Happens If We Flick This Swi...."



RHM
25th February 2007, 09:52 PM
Well that was fun, eh?


We were just informed one hour ago from our building management that an emergency maintenance window is scheduled for this Saturday morning, February 24th, at 1AM PST. UPDATE! The building has just changed the window to be 23 hours later.. it is now scheduled for Sunday morning, February 25th, at 12:01AM PST. This maintenance window involves taking our entire building’s power offline for approximately 3 hours. The word from the building:


It was discovered by ABM Engineering during the power monitoring
equipment installation, that a Phase C Conductor Cable on UPS #4
has been compromised to the point that immediate action to
repair it is necessary or a ground fault will occur to the building
systems.

Since only one of our three data centers is losing power, not all servers will be shut down. Unfortunately, our core routers and our upstream providers are in that data center, so there will be no network to any servers during the window. Gaah.

Since we have advanced (albeit limited) warning of this event, we will be onhand to physically power off all of our equipment at 11:15PM PST. Barring any unforeseen issues (on our end or the building engineer’s) we plan on having everything back up by 4AM PST, hopefully much sooner.

This outage will affect everyone’s sites and email. Email sent to our customers during this outage will be deferred on the sending server and will be delivered after service is restored.

We apologize for the inconvenience this no doubt causes, as well as for the short notice. We will be posting updates as possible during the outage.

So that was supposed be 8.00am until 11.00am. I jumped in online to keep an eye on things.


Power has been restored to the building and all servers are back up and running. There currently is an issue with a blade in one of the core routers and it is being replaced, so the servers are still unreachable until this issue is corrected. We are working to get service restored as soon as possible. Our apologies for the inconvenience.

- UPDATE 04:51 PST -

The supervisor in one of our core routers went kaput when the power went off. It was replaced shortly after the original posting. Further problems arose with corrupt configurations left over from the old supervisor. These have been restored and we have re-established connection to seven out of our eight uplinks. We are currently working on the eighth uplink and everything is coming back to normal operation now. We are close to being 100% and thank everyone for their patience. As always, check this space for updates.

- UPDATE 11:21 PST -

The router is up and healthy, and the majority of our network is back up and running. However, a few of our servers are having issues because their file servers are not talking to the rest of the network correctly. There is a link down from one part of the datacenter back to the routers and that is causing the file server issues. If you are on a server whose name is a beverage, this will effect you. We are working very hard to resolve this issue, and we apologize for the inconvenience.

This was complete bollocks, obviously, as we are on a server called "good" and it only came back up at 9.30pm GMT, 13:30PST. 13 hours, not 3...

I've now pissed off everyone on the DreamHost forums by complaining and even got branded a spammer by the server status board at http://www.dreamhoststatus.com

Great way to spend a Sunday...

The irony is that it wasn't actually broken when they 'fixed' it...

Still, I suppose that's the problem with using an "employee-owned company" for our hosting: they're hardly going to sack themselves for incompetence. :wank: