DrPeering: Notification: The 111 8th Street Lesson

Hey DrPeering -
What Internet Exchange Point has the best redundancy and lowest downtime?
Sal Petersen
----------------

Sal -

All IXes are constructed with some degree of infrastructure and network redundancy.

DrPeering discussed IX redundancy needs with the ISP community and learned something interesting: that they wanted reliability certainly, but most of them also built their networks to withstand breakdowns in the peering fabrics. Many of them preferred a less expensive IX with less redundancy, preferring instead to provide the redundancy themselves !  What these folks wanted was not so much a robust multi-tiered redundant infrastructure, but a lightweight resilient IX with responsive Ops and most importantly, notification when things did go awry.

The 111 8th Street building outage illustrates the three rules of IX outages:

  • things break;
  • everyone recognizes rule #1,
  • how the IX keeps IX participants informed is critically important to the participants.

 

This 111 8th street story focuses on the extraordinary chain of events that led to a 24 hour+ outage.  I share this story with the hope that we learn from it the lesson that notification and continual updates matter a lot to this community.   George Santayana, who, in his Reason in Common Sense, The Life of Reason, Vol.1, wrote "Those who cannot remember the past are condemned to repeat it."  And in fact the Internet operations community has repeated the failure of notification on a continual basis. Here are the broad strokes of the 111 8th street notification story shared with DrPeering anonymously ...

The 111 8th Street Outage

During the New York City blackout a few years back, the 111 8th street building (housing many major telecommunications company facilities) lost power. As designed, the UPS took over the building load and the generators on the roof kicked on.

On the roof of 111 8th street are a handful of generators fueled by a relatively small (500 gallon) diesel tank, refilled by a powerful below-ground fuel pump attached to a couple of 50,000 gallon diesel tanks safely stowed under the building. The system was designed so that when the automatic transfer switch kicked the building from city power over to the UPS and generator system, the fuel pump would also kick on to continually top off the roof top diesel tank. And here is where the first failure occurred.

When installed, the fuel pump was tested, and sure enough, it turned on and seemed to start pumping fuel. The problem was that the polarity was reversed on the pump, so instead of pushing fuel from the underground tanks to the roof tank, it was accidentally rigged to pull fuel from the roof down to the underground tanks!  When the city power was cut, the underground fuel pump indeed powered on and made noise, but did exactly the opposite of what it was supposed to do.

During the blackout, all seemed fine at the Internet Exchange point (which we will call X from here on) until the first generator on the roof stopped running, and then the second, and then the third. The tenants of the building that didn’t have their own generators and fuel were told there was a problem with the building power system. Our IX “X” notified their participants at this point that there was a power problem and the building owner was checking it out.

Another generator kicked off. Only two generators were now left operating on the roof.

Internet Exchange Point Operator X sent a note to their customers suggesting that the building was down to its last generators and that customers should shutdown all unnecessary gear in the 111 8th street Internet Exchange Point Operator X facility.

So far, so good.

The building electricians figured out that the roof tank was empty and the underground fuel pump was indeed running but pulling fuel down from the roof. He tried the polarity reversal test and the roof tank started filling up.

The roof tank filled up. But the generators were not starting up.  Why? 

During the hours that all of this was taking place, the generator starter motors were continually trying to restart the generators. The starter motors burned out!  It took many hours to get the replacement starters delivered through the partying streets of New York City.

Once the new starters were installed, the generators cranked but the generators would not start. Why?

At the bottom of the 500 gallon roof tank was diesel sludge that had clogged the fuel pumps that fed the generators. It took many more hours to get the replacement fuel pumps delivered through the partying streets of New York City.

Once the new fuel pumps were installed (and the sludge cleaned out of the fuel tank and fuel lines), the generators started up.

This all took many hours to unfold. And there were zero notifications sent to customers after the IX lost power - radio silence for over 24 hours of downtime.

During the outage, the building folks were probably scrambling to diagnose and fix the problem and perhaps not updating the tenants that depended on the building for power. Perhaps the building folks updated Internet Exchange Point Operator X periodically. We do not know. But what we do know is that Internet Exchange Point Operator X did not keep its customers updated during the outage, so the customers did not know if their equipment would be powering back up in a few minutes, hours, or days, and could not prepare.  ISPs told DrPeering that they can understand that failures occur, but there were no updates for an outage that ultimately lasted over 24 hours.

Lessons we should learn:

  • I anonymized Internet Exchange Point Operator X in this scenario because this could have happened to any data center operator where power is out of their control,. Indeed unanticipated failure scenarios happen to everyone in the data center space. What we should learn is that what seems to matter most to the participants is how the company responds to these events.
  • The part that was under Internet Exchange Point Operator X’s control was the notification and update, and in this scenario, they failed completely. The ISPs interviewed were very understanding about the outage itself, but were very upset about the notification failure.
  • The way the system failed demonstrates cascading failures. It was the sequence of power outage, followed by fuel pump polarity that led to the starter motor restart and the fuel pump clogging issues.  If the fuel pump didn’t empty the roof tank, the other two failures (roof fuel tank sludge clogging the fuel pump, starter burning out) may not have occurred, at least not then. It was the sequence of failures that made this an extended physical plant outage.
  • An N+1 electrical and mechanical system redundancy did not help.
  • Testing of course was broken here.

Comments

Post new comment