Downtime is not an option
Engineers at Stratus Technologies test their fault-tolerant computers to make sure that customer applications never fail.
Martin Rowe, Senior Technical Editor -- Test & Measurement World, 2/1/2006
![]() |
| SIDEBAR: Locked in step READ OTHER FEBRUARY ARTICLES: Contents, February 2006 |
Maynard, MA—If the servers at banks or other financial institutions fail, they may lose your transaction—and your money. If a pharmaceutical company's servers fail, it may lose manufacturing documentation, and regulations may force it to destroy some of its production. If servers used in security systems or other public-safety applications fail, the consequences can be tragic. For these and other mission-critical applications, downtime can cost serious money, lead to legal problems, or compromise public safety.
Organizations that can't afford downtime often turn to Stratus Technologies. Since 1980, Stratus has supplied customers with hardened, redundant, fault-tolerant servers.
Stratus servers not only test themselves during startup and while in operation, but they also report impending problems to Stratus and switch to backup systems before downtime occurs. The servers employ two and sometimes three redundant CPUs to keep applications running even if one CPU should fail—without the user ever knowing that a failure occurred or that a CPU has been replaced. ("Locked in step," describes how Stratus servers maintain redundancy.)
Stratus, however, doesn't rely solely on redundancy to keep its systems running. "Availability derived from redundancy alone is the exception, not the rule," said Joe Sanzio, director of systems engineering and quality assurance. Engineers design and thoroughly test CPUs and I/O subsystems to resist failure.
Always upTo run 99.999% of the time, Stratus servers need tests that go beyond those found in other computers. The company's engineers test new designs at every level. They inject hardware and software errors, run the systems at low voltages, and remove CPU modules and I/O subsystem modules, called "slices," from a rack while other slices keep running.
![]() |
| Steve Mango, manager of ftServer system development, leads a team of design engineers in hardware design and test. |
Stratus engineers work with the ICS PC Clock Division of Integrated Device Technology to design a custom clock-generation chip. As part of a clock circuit's design verification, Mango's staff measures phase noise, cycle-to-cycle jitter, and long-term jitter with a high-bandwidth oscilloscope. "We always need the highest bandwidth oscilloscopes on the market," said Mango. "That's why we rent them. If we purchased these scopes, we'd have to buy a new one for each new project." At the time of my visit, Mango's engineers were evaluating top-of-the-line scopes from Tektronix and LeCroy before deciding which one to rent.
Engineers also model printed-circuit boards, connectors, sockets, and cables in both the time domain and the frequency domain and measure the effects these components have on signal integrity. They use a Tektronix communications signal analyzer with the time domain reflectometer (TDR) module to take single-ended and differential impedance measurements. They use software to operate the instrument as both a TDR and vector network analyzer.
The other half of Mango's engineering team designs and tests CPU motherboards and I/O subsystems. The I/O subsystems hold or connect to disk drives, keyboards, mice, network cards, and other peripherals. These systems must perform far better than most computers. For example, internal buses such as PCI Express must run a bit-error rate (BER) of less than 10-15. Most computers are deemed acceptable if their bus BER reaches 10-12. To achieve such low error rates, engineers optimized Intel's reference design for PCI Express by moving signal lines and reducing crosstalk.
![]() |
| Hardware design lead Joe Amato uses logic analyzers connected to CPU motherboards so he can monitor bus signals. |
Amato also tests motherboards under varying voltage conditions. He lowers a board's operating voltage to levels below a processor's specifications to find where hard and soft failures occur.
Onboard diagnosticsIn actual operation, a Stratus system's diagnostics will detect a low power-supply voltage before failure occurs, which will force an offline, backup CPU to take over. Mike Kement, design lead for power, mechanical, and compliance, designs circuits that trigger the switchover. Kement designed comparator circuits, based on a voltage reference, that monitor voltages and generate an interrupt that notifies the system of an impending failure. He designs the circuits with enough guardband to ensure switchover without a service interruption. Kement measures the accuracy of the voltage circuits with a Nicolet 12-bit oscilloscope.
Kement also runs power-related tests on CPU and I/O slices. He measures power-supply ripple and noise throughout a board with an oscilloscope. He also tests systems to see how they respond to power-supply shorts and opens. In one test, he shorts a system's 150-A power supply with a 2-in. wide, 0.75-in. long piece of copper to see if it recovers after he removes the short.
"The shorting bar's resistance, 0.015 mÙ, must be low enough to force the supply into current-limiting mode," Kement said. "Otherwise, the short will function like a load and it will melt from heat." Kement reminisced about how he once tried this test with tweezers. Heat from the current fused the tweezers to the power supply. "We put provisions on our boards for test purposes," noted Kement. "Test points let us place shorts across power pins, such as those on disk drives. We make sure that shorts don't cause system crashes."
![]() |
| Figure 1. A voltage-transient test tool provides access to a CPU’s signals. |
As a thermal designer, Kement tests new products with as many as 40 thermocouples to a CPU or I/O slice. A Fluke data-acquisition system collects the data. During a test, Kement controls the power to the chassis' fans with a Xantrex programmable power supply while he measures air flow with an airflow sensor from DegreeC. "We run the thermal tests to make sure that the processors don't enter thermal-throttling mode," Kement stressed. Thermal throttling occurs when a processor senses that its temperature is too high, so it runs at a slower speed to control its temperature. Stratus computers are designed to run cool enough to avoid thermal throttling. Kement adjusts fan voltages to verify that there's enough thermal margin to avoid the condition.
All products also need temperature-cycling testing, which is also Kement's responsibility. He places products in a TestEquity thermal chamber and cycles them from 5°C to 50°C to meet network equipment building system (NEBS) requirements. Stratus servers are qualified to operate in a telecom central-office environment.
In his role as compliance engineer, Kement runs precompliance EMI scans in the lab on all new designs. He uses antennas from ARA Technology and a spectrum analyzer from Agilent Technologies to measure conducted and radiated emissions prior to sending a product to IQS or Curtis-Strauss EMC labs for compliance testing. He also performs power-line tests on all new products. A California Instruments AC power source lets him control mains voltage and inject faults such as dips and other transients. Finally, he conducts ESD tests with a Thermo KeyTek ESD simulator.
Software testsMango, Amato, and Kement come from the hardware side of Stratus, but that's just half of the test story. The rest falls to software and system testing. Software begins with the BIOS, the code that initializes and tests a motherboard and brings it to a state where the operating system can load. A Stratus BIOS is unique because it must initialize two motherboards and bring them into lockstep before the operating system—Windows Server, Linux, or Stratus's own Virtual Operating System (VOS)—takes over. In addition to initializing two or more motherboards, the BIOS performs many system tests.
Dan Lussier is director of firmware and advanced product development. He and others write and test the BIOS code. During boot up, the BIOS checks the motherboards of both the online CPU and the offline CPU. It will enumerate and test the I/O subsystems but make them available to the online motherboard only. If a motherboard failure is imminent, the BIOS will log the failure and activate the offline, backup CPU slice. The BIOS also constantly copies the contents of the online CPU's memory into the offline motherboard in case lockstep breaks and the offline motherboard must take over. Because these systems must always run customer applications, a motherboard can download and upgrade its BIOS without interrupting system operation.
Lussier and others test a BIOS by injecting faults while a motherboard boots. They simulate stuck bits by manipulating registers, and they evaluate how the BIOS responds to those faults. They also perform power failures during a boot to see how the BIOS handles a failed CPU slice.
To inject bit errors into the system, Lussier uses a CPU motherboard's baseboard management control (BMC), which is available through a "back door" command-line interface. Firmware engineers have written scripts that program the BMC to inject errors and report results. The BMC should report any and all hardware errors in a system and it should prevent a motherboard from coming online if it detects a hardware failure during a power-on self-test. "We still perform some BIOS tests by hand," admitted Lussier, explaining that engineers sometimes use switch boxes to inject bit errors. "But we're always looking for ways to automate our tests."
Automation is key to performing repeated system-level tests that can expose stability problems. John McQueeney, manager of platforms and options system test, leads a team of 10 QA engineers who also review designs. They look for ways to automate testing of a new product before they receive it. They test systems following integration of the operating system and any peripherals. McQueeney's team looks at system-level issues. "We test a system to see how the operating system responds to faults," said McQueeney. His engineers look for overall system stability following a fault.
McQueeney's engineers start by injecting single faults at the system level. Some are bit errors in cables while others come from simulated broken communications lines. For example, they use optical switches from Apcon to break optical communications links that carry Ethernet or Fibre Channel packets. If a system test point is suitable, they automate that condition to test it thousands of times. They also inject packet errors and simulate heavy amounts of traffic with either a PC or another ftServer.
![]() |
| Joe Sanzio, director of systems engineering and quality assurance, oversees hardware, software, and system-level testing. |
"If a system repeatedly finds a bit error in the field," added McQueeney "it's taken out of service, so engineers try to force those conditions in the lab to ensure that the system tracks mean time between failures." For example, if a system detects the same error once an hour for 10 consecutive hours, it will notify Stratus to send a replacement part. All motherboards contact Stratus directly, either by sending an e-mail or by dialing through a modem. A customer will receive a replacement slice the next day and technicians will install it without causing a service interruption.
McQueeney's team automates testing wherever possible. They've written scripts to generate network traffic and to initiate simulated faults. Until recently, they used VBScript, but now they use Perl. "We switched to Perl because it's a cross-platform language," said McQueeney. "It runs on Windows, Linux, and VOS. It can also call C-based routines. We wrote the scripts for Windows and we just need to recompile them for Linux and VOS."
The highest level of test responsibility falls to senior QA engineer Henry Ellis, who tests systems with software faults that simulate in-service errors. Using the company's VOS, Ellis runs systems to exhaustion. He focuses on VOS because the customers with the most critical needs use it.
Ellis pushes system resources beyond their designed limits by running more applications than he expects a customer to run. He also runs systems to their limits of memory and disk space to make sure they won't disrupt service. "I look for new ways to introduce trauma," he said proudly. "It's the only way we can fully understand how the customer uses our product."
Peripheral testsStratus engineers perform the worst-case tests not only on products they design, but on purchased products, too. Stratus computers use industry-standard components wherever possible. Components include memory modules, hard drives, keyboards, mice, network cards, and other cards that use the PCI, PCI-X, and PCI Express buses. Unfortunately, these products are often out of Stratus' control, so they require constant testing.
Sanzio cited the testing of purchased components as one of the major challenges facing his team of 30 people. For example, he found a memory-module maker who changed to a different IC package. The change was irrelevant to many users, but not to Stratus, for the new parts failed when installed in a Stratus server.
Stratus engineers often uncover design flaws in purchased products. "We bring out the weaknesses because we often test products beyond their specifications," said Mango. "We once uncovered a timing error in a memory module that was sensitive to low power-supply voltages and shared our results with the supplier. After a year of working with the supplier, we dropped the company from our supplier list when the problem remained unresolved."
Firmware changes in purchased products can also affect their functionality in a fault-tolerant environment. When a maker of CD-ROM drives made a firmware change, Stratus systems failed a test that required an offline motherboard to become active. The CD-ROM failed to operate when the system switched to a backup CPU slice.
Crashes caused by poorly written device drivers are a significant cause of system restarts in most computer systems. Because system restarts are unacceptable, the company's software engineers write their own "hardened" drivers. Sanzio noted that Stratus had to modify Intel's drivers for its Gigabit Ethernet adapter cards and Fibre Channel host bus adapters to make them more crash resistant. The company's software engineers also write drivers for USB peripherals that ensure there's no chance that removing a USB device can cause a system crash.
Long-term challengesBecause customers use the company's products for longer periods than do users of general-purpose computers, new Stratus products must maintain complete compatibility with older systems. If a customer's system needs a replacement CPU slice, for example, the new slice must operate with that system's installed peripherals. Often, Stratus also must provide firmware that's not the latest version to maintain compatibility with older products.
Stratus engineers must also contend with issues that affect the entire electronics industry. For example, Sanzio sees compliance with the European Union's new directive on hazardous waste (RoHS) as a serious challenge facing the company. The changeover to lead-free components requires complete qualification tests on every aspect of a product. And as you can see, that's no small task when testing a product where downtime isn't an option.
|


























