…and Why I Haven’t Updated in a Few Days
An astute reader might recall the so-called “capacitor plague” from the earlier part of this decade. The general consensus holds that the plague of failing capacitors originated from corporate espionage and the theft of an electrolytic formula minus a critical component. Without the critical component–a stabilizer–charge and discharge cycles combined with their respective heating a cooling would eventually generate a build up of hydrogen gas, triggering a potentially catastrophic failure of the capacitor.
I recall reading about that in 2005, because it was then when the influx of failing boards that had integrated these capacitors from the years previous began to hit computer repair shops. I was working for TCI during my fall semester of that year, next door to MDC Computers, and I recall that for several months, they were tending to nearly a machine a week suffering from “bad caps.”
When I left to finish my studies, I thought that the faulty capacitor problem would be destined to become a distant memory. In December 2006, I built my existing workstation; it was reasonably inexpensive, and I’ve always had an interest in building and integrating the components of my own volition, but I never realized that a fairly critical component would fail about three years and three months later due to precisely the same reason that had kept the guys at MDC insanely busy for months.
It was Thursday evening, and I was working on a TurboGears project. As is typical of those evenings when I feel endowed with a sense of adventure enough to explore unfamiliar frameworks, I had a few dozen windows open, some music playing, several SSH sessions, and the likes. I recall that I was exploring some of TurboGear’s internals and was in the process of setting up a template to test a new idea.
Then my monitor shut itself off.
Over the years, I’ve experienced a few unusual hardware failures, including faulty video cards, but save for a catastrophic event tied to CPU, motherboard, memory, or power supply failures, I’ve never experienced many that would freeze the entire operating system. Everything had frozen; no longer were my speakers playing in the background (I was listening to something in the trance genre) and neither would the machine respond to pings from the file server. It was dead.
Of course, when you’re working on a system that mysteriously freezes, your first inclination is to send a few curses about software problems flying away. I had been booted to Windows that day, and you might imagine that several of my foul words had been directed toward Redmond, Washington! Whatever it was, it was probably a driver fault, and I suspected at the time–ironically, in retrospect–that it was tied to the video card. NVIDIA’s drivers are reasonably stable, though their quality has degraded over time, and I’ve seldom encountered any extraordinary circumstances under which they’ve failed. My thoughts, then, shifted to my aging Sound Blaster Live!; two months prior, it began to exhibit a foul temper when loading games and often generated a blaring, throbbing tone in protest. Creative Labs certainly would never be a company one might accuse of writing outstanding drivers! (In their defense, they have gotten better over the years, but I’ve seen more than my fair share of “irq_not_less_than_or_equal_to” BSoDs implicating Creative’s sblive! driver.)
Nevertheless, a simple software problem can always be solved by popping the reset button, booting back to the operating system at fault, and examining logs for hints that might lead to the apprehension of the bug in question. Once Windows finished starting, I entered my password and waited. As the background appeared and the taskbar was loading, the black screen mysteriously appeared and the system hung.
I sat there for a moment perplexed that the same event which was originally responsible for forcing me into troubleshooting mode had returned to haunt me a second time. Worse, there was no evidence of a driver-related BSoD or other kernel-level OS fault (Windows started normally without so much as a prompt to enter safe mode). I knew what to do: Boot to Gentoo, examine the message log for potential clues that might indict a growing hardware problem, and then take a closer look at the Windows partitions for further evidence of possible kernel dumps.
I made the decision then to boot Gentoo and waited. A few errors popped up during init that indicated my Windows drive couldn’t be mounted, and neither dbus nor hald were able to start. I wasn’t terribly surprised by the latter, but the former event had me puzzled. dbus and hald were dependent upon a library I had rebuilt some days earlier, and I suspected then that they were linked to the earlier version; certainly nothing a revdep-rebuild wouldn’t fix. But the issue with my Windows drive being unmountable bothered me. Had the drive actually died? There wasn’t any indication of an immediate hardware problem in the device, but it most assuredly could not be ruled out.
As I entered my password and waited momentarily for the X session to finally launch, I mulled over the likelihood of a hard disk failure. That had to be it! No matter, I thought, I’d reimage the machine tomorrow and make my determination of a potential storage issue. Yet again, my contemplation was interrupted by a black screen–and a total system freeze.
Double-U Tee Eff, question mark.
I tapped my fingers on the keyboard. This was unusual, very unusual. Two hard drives failing? Possible, yes, but unlikely. Perhaps the SATA controller on the motherboard had gone on the fritz or was on its death throes. Clearly, something was angry. Very, very angry. I had one more operating system to try.
Another tap of the reset button, a few thumps of my thumb against the wrist rest, and Ubuntu was loading. Ahhh, KDE 4! How I had forgotten about you, I thought. Surely this would work. I tapped away my password, hit enter, and waited…
Black screen. Again.
Whatever was happening was far from humorous now. I reached over for a live CD but withdrew my hand only moments before touching the case. That wasn’t going to work if the SATA controller had indeed failed–my DVD drive is also plugged into that same controller! I’d have to dig up an old PATA drive later if I wanted to even consider the possibility of booting to a repair disk. BIOS might have additional clues.
One more press of the reset button revealed a more puzzling ending: nothing. No boot. No beeps. No errors. The system was dead.
For the rest of that night, I tried a variety of tests and made dozens of attempts to resuscitate the machine. Nothing worked. Not until the morning after I had tested the system’s power supply before I glanced over at the video card sitting on my work bench had the culprits been revealed.
The board in question was an EVGA 7600 GS (512MiB RAM) and failed approximately three months after the end of the 3 year warranty period. Of course, it isn’t like I’d actually want to replace it; I was due for an upgrade!
The interesting bit about this story is that about four or five months ago, I was startled at the sound of a loud “pop” coming from the direction of my workstation. I assumed it was the violently explosive reaction of a wayward bug that managed to discover a short, yet nothing was immediately indicative of any problem. I even disassembled my workstation and examined the motherboard for capacitor problems, but I never considered that the video card would be at fault. As it turns out, capacitor failures on this board are unsurprising. It’s just interesting how long the card managed to limp along without any overt indication of its plight. (When I questioned my father–an electrical engineer–about this, he explained that capacitors which fail with a burst along their top vent still have some capacitance that diminishes over time and exposure to heat. Eventually, they will fail completely, but unless the failure is close to the contact point, the capacitor can still work.)
It was an interesting find, to say the least!