ZDNet UK


Skip to Main Content

ZDNet.co.uk - Winner of Best Business Website 2007
  1. Home
  2. News
  3. Blogs
  4. Reviews
  5. Prices
  6. Resources
  7. Community
  8. My ZDNet

 

ZDNet UK RSS Feeds


IT Jobs

Server platforms Toolkit

Anatomy of a server-room meltdown

Matt Loney ZDNet.co.uk

Published: 02 Aug 2004 16:35 BST

  • Email
  • Trackback
  • Clip Link
  • Print friendly
  • Post Comment

The following story is a cautionary tale for anyone who runs a server room.

Back in June, the UK experienced its first hot weekend of the year. One IT manager, who asked to remain anonymous in return for sharing the litany of horrors that followed that weekend - but we'll call him Bob - spent Saturday and Sunday, like most people, enjoying the sunshine. Like most IT managers Bob carries a phone, to which his monitoring systems send text messages should anything go wrong in the server room. On this particular weekend, like most others, there were no text messages warning of any problems, and Bob spent a relaxing couple of days in the sun, safe in the knowledge that the servers back at work were humming quietly away.

Bob's weekend was only spoilt slightly on Sunday evening when he tried to log onto his corporate email account but couldn't connect for some reason. Never mind, he thought, a switch must have failed. It will just need a quick reboot in the morning.

How wrong he was.

"I turned up to work on Monday morning," says Bob, "to find the whole comms room had gone down. When I opened the door the temperature was about 45 degrees (Celsius)."

When the temperature in a comms room reaches that level, there is only one explanation: the aircon has failed. "We had two units, which we thought provided redundant air conditioning," says Bob. "But when one seized the second one was unable to cope with the load and so that one shut down too."

As if that wasn't bad enough, in the building where Bob's company is located, the main air conditioning is shut down at weekends to save money. Even in the winter, the offices can be pretty warm first thing on a Monday morning; in the summer they're stifling. So just imagine what it's like in a nicely insulated room with several dozen email, Web and application servers churning out many hundreds of Watts. As Bob put it, "The trouble with comms rooms is that when you switch the aircon off, they stop being a cool room and turn into an oven."

Obviously one of Bob's first jobs on Monday was to bring the temperature back down. The other, less obvious job (to anyone who has never had an aircon unit fail) was to start mopping up. "When the aircon swithced off," says Bob, "moisture condensed in the pipes that lead to the units on the roof." As this moitsure condensed, there was only one place for it to go: down the pipes, through the vents and onto the server-room floor. As for the temperature, says Bob: "On the Monday morning, we restarted the one working air conditioner and that began to have an effect. Then we looked for the cause of the equipment shutting down -- it turned out that the UPS had reached its critical temperature and powered down to protect itself."

There were actually two UPSes - one main one and a second, smaller one, for the monitoring system. The smaller of the two should survive at least 20 minutes after any power failure to send out text messages to support staff. This did not happen. The smaller UPS did not have a thermal shut-down - instead, it just fried.

By the end of the day the single aircon unit had brought the temperature back down, the IT team thought they were ok, and that they could survive on the single unit for the short-term. After all, all they needed to do was call the aircon engineer and everything would be hunky-dory.

But life, as most of us know, is rarely that simple.

Read on to find out what went wrong next.

Next

Previous

1 2 3


  • Email
  • Trackback
  • Clip Link
  • Print friendly
  • Post Comment

Did you find this article useful?
150 out of 303 people found this useful


Full Talkback thread

1 comment

  1. Advocates of business grid computing should learn... Anonymous

Company/Topic Alerts

Create a new alert from the list below: