Looking for trouble
Gamers can explore NVIDIA-powered virtual worlds, but there's nothing virtual about the IC defects NVIDIA's failure-analysis team must locate.
Rick Nelson, Chief Editor -- Test & Measurement World, 5/1/2004
|
Santa Clara, CA—NVIDIA chips provide the graphics power behind dazzling PC games like Doom, and they help NASA scientists reconstruct photo-realistic 3-D views of Martian terrain from data transmitted by the Spirit and Opportunity rovers.
Many of the approximately 1400 people at the NVIDIA campus here could be said to inhabit virtual worlds similar to those of gamers and earthbound scientists navigating the Martian surface. NVIDIA is a fabless semiconductor company, so many employees spend their time in software-development or electronic-design-automation environments.
There's a key exception, though: The company comes face-to-face with the real world in its failure-analysis laboratory, where a team of seven face tasks such as finding the one bad transistor—out of hundreds of millions—that causes a single chip to fail.
![]() |
|
Howard Marks works with an EmiScope system to locate failures in multimillion-transistor devices. He and his team can monitor the propagation of signals along internal paths—even 2.4-GHz signals traversing PCI Express lanes in NVIDIA's newest chips. |
"We're a fabless company," says Howard Marks, manager of NVIDIA's silicon failure-analysis laboratory, "but we insist on understanding what is going on in the silicon that we are having somebody else build for us, whether it's TSMC, IBM, or UMC, or others, such as Chartered and SMIC, that we may employ in the future."
To do that, Marks has established a world-class failure-analysis laboratory, which includes the gamut of chemical and electrical equipment found throughout the semiconductor industry. He employs three electrical engineers, who make use of automatic test and inspection equipment to analyze failed parts, plus three deprocessing experts, who can disassemble packaged chips to permit detailed looks inside.
Indeed, semiconductor manufacturers use various technologies to put together chips. Marks and his team adapt the same underlying technologies—in the form of decapsulation, grinding, plasma etching, delayering, and lapping—to take those chips apart again. For a close-up look, they can use their JEOL scanning electron microscope (SEM), which provides magnification to 300,000X while employing an energy dispersive x-ray that tells what element is present, thereby helping to identify contaminants. With the SEM, Marks says, "we can zoom right in on exactly where the failure is, see why it occurred, and feed that information back to fab."
Knowing where to lookMarks's lab sports an impressive array of equipment tailored to look for problems, but with 200 million transistors per chip, even a world-class operation needs some clues as to where to begin. It's not practical to use deprocessing technology and a SEM alone to try to find the one-in-a-million (or more) bad transistor. Much more finesse is required to learn what semiconductor layer to investigate and where to point the SEM, and it often begins with involvement of the original chip designers.
The lab remains a mystery to many such designers (not to mention the many employees who write drivers or test games). The designers, Marks says, are often in for a surprise should they have occasion to visit. "When problems occur in a device, we invite them in, and the first comment is often, 'oh, so that's what silicon looks like.' They've worked all year long designing this part, yet they have no concept of what the physical implementation will look like."
The likely occasion for a designer's visit is when a prototype back from the silicon foundry doesn't work. Other times, though, a defective chip, designed many years ago, might be returned from the field, and Marks has to track down the original designers. "Often, they've moved on to management or executive positions, and their reaction is, 'I designed that?' But we'll jog their memory, and they'll come in and help us out."
Designers afford some idea of a device's failure mechanism based on the observed behavior. EDA companies like Synopsys, says Marks, have gotten better at helping designers with "reverse engineering"—the process of extracting failure information from test-vector responses to automatically generated test patterns. Reverse engineering can isolate a failure to a particular internal scan chain, and correlation with physical placement tools can pinpoint where the relevant devices are in the device under test.
Physically accessing such parts on an integrated circuit has always been difficult. Marks reports that with two-layer devices, he could use a laser to open a path to top metal, and then drop in a probe connected to a scope or other instrument to make electrical measurements.
With multiple layers, though, that approach became untenable. Marks reports that the laser would vaporize the target metal. So he turned to focused-ion-beam (FIB) approaches to deposit probe points, although with a FIB, the DUT needed to reside in a vacuum yet maintain connections to a source of electrical stimulation—an ATE system or, for in situ testing, a PC motherboard or graphics card.
As chip complexity increased, however, he and his team needed to reroute top-layer signals to open a path to deeper ones. Furthermore, the deposited probe points added trace capacitance that affected the signal being measured. Marks and his team continued to adapt by, for example, turning to active probes to ameliorate capacitive effects.
Nevertheless, the greater complexity resulted in the team spending a lot of time using the FIB and also going through a lot of devices to locate and analyze faults. To cut down on their labor, they turned to photon-emission technology, employing a Credence Optonics EmiScope to monitor the photons emitted when transistors switch.
Using an EmiScope, the team can monitor the propagation of a signal along an internal path—even 2.4-GHz signals traversing PCI Express lanes in NVIDIA's new chips. If at some point along a signal chain they find an unexpected, incorrect transistor output, they know they have isolated the fault and know where to point the SEM to find out why.
Marks and his team employed this technique on the company's first 0.13-micron chip from TSMC to locate a transistor gate that leaked current, inhibiting proper latch-up. NVIDIA and TSMC fixed that problem for the subsequent device, which went directly into the field without a stop at the failure-analysis lab, Marks reports.
Photon-emission-based failure analysis does have some drawbacks, Marks explains. For example, if you connect an electrical probe to an oscilloscope, you can see parameters like signal rise time; a rise time that is too long can indicate a defect such as a resistive via. Photon-emission systems don't permit observation of rise time directly—you see only the photon emitted when the target transistor switches. In addition, photon-emission systems cannot detect a voltage level.
Yet, Marks says that by measuring the timing of signal propagation along a signal chain, he can infer exactly where rise time might be a problem and then turn to other methods. "Ultimately, you always need a microscope to actually see what's going wrong." In general, he's satisfied with photon-emission technology and is even employing it for front-side analysis of wire-bonded chips. "You don't have to keep drilling new holes for new probe points as you advance along a signal chain."
ATE considerations
![]() |
|
Figure 1. An extender card positions a graphics-card-mounted DUT flip chip for analysis by an EmiScope photon-emission-based failure-analysis system. Courtesy of Howard Marks. NVIDIA, adapted from an ISTFA presentation. |
As an alternative, he can employ an ATE system as the stimulus source, which can be advantageous for repeatedly exercising a DUT with vector sequences that elicit a failure. For ATE-based failure-analysis tasks, he solicits the help of the 17 ATE test engineers NVIDIA employs. "They're in and out of the lab. We grab them for failure analysis when we need them."
Currently, Marks uses the Agilent 93000 as the stimulus source. One reason, he says, is that the 93000 is a state-of-the-art system that can keep up with NVIDIA's parts; another is that "Agilent gave us a free tester," in the hopes that NVIDIA, having developed its test programs on the free platform, would encourage its foundries to order large quantities. "We have to put out 4 or 5 million a month of our new parts," he says. "All that generates a lot of test."
Asked if he expects to always dictate test-platform requirements to NVIDIA contract manufacturers, he says, "No, we work with our suppliers and vendors. They don't dictate to us, and we don't dictate to them." In fact, he says, efforts are now going on to move some mature products (which he says continue to sell by the millions, though they don't generate the headlines of the new state-of-the-art parts) to less expensive test platforms.
Traditionally, he says, "New devices go faster and faster, so testers have had to get faster and faster. Now, though, we are looking into DFT testing, which, all of a sudden, lets the tester become slower because you're doing all internal testing. That helps our suppliers, although it's more work for us because we have to develop the DFT functionality. But we get some benefits out of it, too, because we are able to do more intense testing internally and get more test coverage. So you see, there is a lot of give and take" between him and his suppliers.
When asked if he fears that Credence will limit his test options by building hooks into the EmiScope that would make it work most efficiently with a Credence tester, he says, "They know not to do that. They know they have to work with other testers." He cites efforts such as STIL and OpenStar as evidence that tester manufacturers are getting the message that they need to move toward standardization and interoperability. Ultimately, he wants to choose a tester based on price and performance—not on whether he has access to a test engineer who knows how to program it.
When asked if he expects his suppliers to take on more failure-analysis chores, he says no. "Since there are so many different functional blocks inside our devices, there's no one person who knows everything that's going on. There's no one group even. Someone in Taiwan wouldn't know where to begin to look."
He does, however, see a place for Internet-based remote control of equipment like the EmiScope, which could be located overseas. Within the NVIDIA campus here, he's had some experience with that approach, allowing a designer to run the EmiScope from his desk.
Getting a start in failure analysisMarks started at NVIDIA in 1998. Since then, he says, "All from scratch, I designed this lab, have bought every piece of equipment in it, and have hired the people."
Before coming to NVIDIA, he worked at Cirrus Logic, managing its failure-analysis lab, and before that, he worked at Amdahl, handling the various failure-analysis chores involved in bringing mainframe computers to market: "motors, frames, silicon, boards, whatever." Before Amdahl, he worked in process and product engineering at National Semiconductor.
"When I was a co-op in engineering school, I designed with little black boxes called ICs. I wanted to know what they were. We got Intel 3-in. diffusion furnaces that [Intel] was throwing away and installed them in our lab at school, and we were making our own devices. At Amdahl, I got a closer look inside. We were using Mostek DRAMs, and I began taking one apart. I took off the top layer, two layers, three layers, and so on, and I would look through a microscope and draw the circuits in that device. I did timing analysis and found out there were some timing errors. We sent the DRAM back to Mostek, and they fixed their chip. That was my start in failure analysis."
That experience has governed his failure-analysis philosophy since: "When you find a failed part, don't just say 'you're no good' and throw it out. When your assembly house makes a mistake, don't say 'you're no good' and drop it" from your list of qualified suppliers. "Feed the information back to your suppliers and get them to fix the problem. The industry as a whole will be better for your efforts, with next-generation parts having ever better reliability."
| Partners In Test | ||
| Agilent Technologies, Santa Clara, CA; www.agilent.com | Credence Systems, Milpitas, CA; www.credence.com | FEI, Hillsboro, OR; www.feic.com |
| JEOL USA, Peabody, MA; www.jeol.com | ||
For more information on failure analysis, visit www.tmworld.com/fa.
|




















