The adage is that, "In theory, theory and practice are the same. In practice, they're not." I would like to hear stories that prove this true. I have one.
In the late 80's I was working for a company that made stuff for government agencies. We built a lot of the stuff from the PCB's up. We liked ethernet, but it was still expensive to build, test, and integrate, and most facilities were not ready to use it extensively, so most of our devices communicated over serial links using opto-isolated current loop or differential interfaces for noise immunity. We were having a lot of trouble with equipment failing in a facility not too far from Chicago. We assumed that it was a software problem because if you powered the equipment down, then back up, it would operate normally again. Failures were often noticed at shift change (24-hour facilities, with on-site maintenance and tracking of failures and recoveries; we had a contractual obligation to meet a specific MTBF target).
A small team was sent to the facility with me included. We were sent with a few tools including a Tektronix 1240 logic analyzer, in-circuit emulator for the microprocessor in the failing units, HP 4952A protocol analyzer, and a DEC minicomputer to run the cross-development toolchain for the target systems. Lab space was provided by the facility, with access to our units in the field as needed (we were provided with facility ID, and escorts as required).
We found a number of interesting problems that prove the old adage about theory and practice.
1. One of the developers on the project had removed power-up ROM and RAM tests because
"manufacturing test will catch them, and it will shorten the boot-up time". When we re-enabled these tests we found several units in the field that failed immediately. Some of the chips were not seated properly in their sockets (including bent pins) and had not been caught in test.
2. Some ribbon cables had been improperly routed and had been punctured by the pins from the back of one of the modules when the unit's lid was closed. It was assumed, but never proven, that this was shorting some signals in the cable and module. The cables had to be replaced and routed properly. How these units passed manufacturing test I'll never know.
3. Around midnight one night I discovered that I could build up a pretty good static charge by walking around in the lab despite the tile floor. The air was very dry. I touched the stainless steel case of one of the units in the lab and it failed immediately. I cycled power, and it resumed operation, so I did it again. It failed again. I couldn't make it fail if the in-circuit emulator was connected, so I decided to risk the Tek 1240 (pretty expensive gear at the time), and set it up to trace the address and data buses on the unit's CPU. I discovered that one of the serial I/O chips was generating a DCD (Data Carrier Detect) loss interrupt even though the DCD- pin on the chip was grounded. The software was not correctly decoding that interrupt because "it couldn't happen". I learned later that the interrupt was caused by ground-bounce under certain conditions. I fixed the interrupt decode software, burned new EPROMs and by dawn I had upgraded the dozen or so units that had been failing most frequently. After upgrading the rest of the units the failure rate plummeted to well within our contracted MTBF.
In the late 80's I was working for a company that made stuff for government agencies. We built a lot of the stuff from the PCB's up. We liked ethernet, but it was still expensive to build, test, and integrate, and most facilities were not ready to use it extensively, so most of our devices communicated over serial links using opto-isolated current loop or differential interfaces for noise immunity. We were having a lot of trouble with equipment failing in a facility not too far from Chicago. We assumed that it was a software problem because if you powered the equipment down, then back up, it would operate normally again. Failures were often noticed at shift change (24-hour facilities, with on-site maintenance and tracking of failures and recoveries; we had a contractual obligation to meet a specific MTBF target).
A small team was sent to the facility with me included. We were sent with a few tools including a Tektronix 1240 logic analyzer, in-circuit emulator for the microprocessor in the failing units, HP 4952A protocol analyzer, and a DEC minicomputer to run the cross-development toolchain for the target systems. Lab space was provided by the facility, with access to our units in the field as needed (we were provided with facility ID, and escorts as required).
We found a number of interesting problems that prove the old adage about theory and practice.
1. One of the developers on the project had removed power-up ROM and RAM tests because
"manufacturing test will catch them, and it will shorten the boot-up time". When we re-enabled these tests we found several units in the field that failed immediately. Some of the chips were not seated properly in their sockets (including bent pins) and had not been caught in test.
2. Some ribbon cables had been improperly routed and had been punctured by the pins from the back of one of the modules when the unit's lid was closed. It was assumed, but never proven, that this was shorting some signals in the cable and module. The cables had to be replaced and routed properly. How these units passed manufacturing test I'll never know.
3. Around midnight one night I discovered that I could build up a pretty good static charge by walking around in the lab despite the tile floor. The air was very dry. I touched the stainless steel case of one of the units in the lab and it failed immediately. I cycled power, and it resumed operation, so I did it again. It failed again. I couldn't make it fail if the in-circuit emulator was connected, so I decided to risk the Tek 1240 (pretty expensive gear at the time), and set it up to trace the address and data buses on the unit's CPU. I discovered that one of the serial I/O chips was generating a DCD (Data Carrier Detect) loss interrupt even though the DCD- pin on the chip was grounded. The software was not correctly decoding that interrupt because "it couldn't happen". I learned later that the interrupt was caused by ground-bounce under certain conditions. I fixed the interrupt decode software, burned new EPROMs and by dawn I had upgraded the dozen or so units that had been failing most frequently. After upgrading the rest of the units the failure rate plummeted to well within our contracted MTBF.