Huge outage angers Mimecast customers

Datacentre services down for three hours yesterday, with vendor still fielding user problems today

Mimecast has apologised to angry UK customers after a three-hour outage yesterday left them unable to use its email services, leaving a backlog it is still fixing.

A network hardware failure shut down services from one of its UK datacentres causing an outage between 11am and 2pm yesterday, but customers and partners are still contacting the firm in an attempt to get back up and running.

Mimecast tweeted its customers through the night to try to ease the problems. The response was praised by some but was not enough to ease some others' concerns.

IT consultant Si Macintosh told the vendor via Twitter that the issue had been a "total disaster" as "valuable time and money [had been] lost".

Another Twitter user affected by the outage told the vendor: "Many of us sign off on Mimecast as part of a business continuity plan. How does that make us look?"

In a recent blog update at 10:45am this morning, Mimecast claimed that its infrastructure and technical teams were fixing more services as time went on, but that some residual problems may still exist, especially with its services for Outlook.

As part of its efforts to fix the problems, customers hosted on certain areas of the datacentre – its Service 63 and Service 64 cluster pair – will experience a further loss of service this afternoon due to another hardware failure, it claimed on its blog, in which it added there was "no alternative" for those affected.

Very sorry

Soon after the problem started yesterday, Mimecast's chief executive issued a grovelling apology, in which he explained why the outage happened.

He said: "For three hours today we did not live up to our availability promise. We are very sorry.

"Over the past 10 years we have not had any significant outages because of our infrastructure and because of the constant scenario planning we conduct to ensure we are militating against any points of failure.

"As a cloud vendor, our platform infrastructure works in an active-active model, where communications are handled by all sides of our grid. If there is any unavailability in a component, another part of the grid can take over. Failing over an entire datacentre happens extremely rarely and we deliberately do it manually as an automatic failover of this scale brings significant risks.

"The plans we had in place underestimated the time it would take to complete the task. We aim for under 30 minutes, however this one took us over two hours.

"We will be reviewing this procedure and making sure that we can do it faster – much faster – should we be called upon to do it again."