|Friday, 26 February 2010 15:46|
Why are We Surrounded by Faulty Products Like Toyota Brake Pedals?
Copyright 2010 Kevin D. Stokes
"...as software complexity and size increase, it will eventually become prohibitively expensive and time consuming to reach the required levels of software quality via careful design, rigorous process, and painstaking testing..." - James Hamilton, Microsoft Developer 1
Three Quick Answers to the Question which are (sometimes true but in general) Incorrect
I'll revisit these incorrect answers later, but first let me start out by admitting that I am a computer programmer. I do this for a living, and the software I write is running in embedded devices all over the world. You or someone you know is very likely to have made use of it (indirectly) for at least a fleeting moment. Thank goodness that when my software fails nobody can possibly get hurt or even significantly inconvenienced. Notice that I said 'when' and not 'if'. I am certain that sometimes my software fails on rare occasions, although I certainly do not have specific knowledge of any outstanding major bugs. I'll bet when there is a failure, the operator simply shrugs, and power cycles the device and continues on.
Toyota: Software Bug in Prius Brakes
"Toyota officials described the problem as a "disconnect" in the vehicle's complex anti-lock brake system (ABS) that causes less than a one-second lag. With the delay, a vehicle going 60 mph will have traveled nearly another 90 feet before the brakes begin to take hold."
Now that I have admitted this, I ask the broader question:
What is happening to the world of consumer products?
Features and performance are ever increasing, but at the same time, reliability is decreasing. Take a minute to think about all the products you own which have little computers in them. How many of them have failed, acted strangely? Have you ever needed to power them off and on again, or take the batteries out for a minute to get them to work again? No big deal, right? But hasn't anybody noticed that this problem is getting worse; That we have become so accustomed to random failure that we shrug it off, when instead we should be wondering where this trend is leading us? The crisis is already bad, and below I'll make the case that it is only going to get worse. Like other trends, this cannot continue, but what will eventually happen?In the next section, I will explain the types of failures I'm talking about. These are inexcusable failures in common name-brand items.
Six Examples of my Own Consumer Products that Have Really Annoying Faults
Here are some examples of recently purchased name-brand consumer products in my family which have inexplicable failures. There are many more, but I chose these because it represents a variety of product types and companies.
The Opiated Donkey Phone
I have an ordinary land line telephone (non-wireless) which every once in a while goes into molasses mode where the ring sounds like an opiated donkey, and it works but everything is just so slow. The solution: pull the plug and plug it back in. Then it's fine.
The Uncontrollable Muzhdhfoffpla Player
And my new Creative MP3 player, well that would occasionally go into this weird mode where it would totally garble the music and none of the buttons including the volume or power buttons could stop it. The only thing to do was let it play the mangled song until it finished and then it was ok again. There was no rhyme or reason to the failures. Finally a firmware upgrade fixed it. It therefore was a software bug which somehow made it through their testing.
Built to be Cool but makes you look like a Fool
The anti-theft system in my Mustang on rare occasions identifies me as a car thief and makes the ignition not start, and I have to wait 60 seconds before I can start the car again. That is pretty embarrassing if the reason you are trying start the car is because you stalled out in an intersection.
My daughter's new iPod Touch one day got itself in a mode where the volume couldn't be adjusted. The slider would move, but then jump back to the middle, and then started getting louder when she turned it down, and softer when she turned it up. The solution? She had to turn it completely off and reboot the device. She reports this has happened twice in the last two months of daily use.
The Dishwasher of Broken Promises
Our dishwasher has a feature where you can tell it to start in 2, 4 or 6 hours. But it only mostly works. It sits there, with the little light on, promising to start at the directed moment, but once we go off to bed and the moment arrives, well, I don't know what happens, but frequently the dishes don't get washed. In the morning, the little light is out, but the dishwasher has done nothing.
My automatic lights outside the house sometimes get confused and won't automatically turn off or won't turn on. The solution, just turn the power off and leave it off overnight. Then it is fine
Revisiting the Easy Answers
The following easy answers are what seems obvious to people. If a product has a flaw, then the designers just did not try hard enough to get it right. But computer software is an entirely different animal than non-computer products. The experts in the field have concluded that there simply is no way to make reliable computer software, so they are adopting the 'Good Enough' strategy, because that is the only thing they can do if they insist in putting complex software in their product. To see why the three reasons below are in general not the reason behind faults in otherwise good quality products, read the quote from Cam Kaner below, and if still in doubt, follow the links and read his more indepth articles.
Cem Kaner is a Professor of Software Engineering at Florida Institute of Technology and has written five well-received book on software reliability and testing.
"The quality/cost tradeoffs are difficult. I don’t think that it is possible today to release an affordable product that is error free and that fully meets the reasonable quality-related objectives of all of the stakeholders (including the customers). The goal of what is recently being called the "Good Enough Software" approach is to make sure that we make our tradeoffs consciously, that we look carefully at the real underlying requirements for the product and ask whether the product is good enough to meet them." - taken from The Impossibility of Complete Testing
Is it going to get Better or Worse?
That was just six examples, which I chose from a much larger set of such troubles that I personally have experienced. Already, I have enough digital computers around me that I never know when something is going to fail for an unknown reason. My concern is that if standards for reliability have dropped this much in the last two decades, what will happen in the next two?
Bad Software Design In Medical Device Massively Overdoses 6 People
Between June 1985 and January 1987, a computer controlled radiation therapy machine called the Therac-25 massively overdosed six people.
"The operator activated the machine, but the Therac shut down after five seconds with an HTILT error message. The Therac-25's console display read NO DOSE and indicated a TREATMENT PAUSE. Since the machine did not suspend and the control display indicated no dose, the operator went ahead with a second attempt at treatment... Again the machine shut down in the same manner. The operator repeated this process four times in a row after the original attempt...."The patient had recieved a treatment of 13,000 to 17,000 rads, which is about 75 times the normal therapeutic dose.
In all the examples, every single one of them is most likely a bug in software. Minor annoyances, right? But what happens when these devices are operating the brakes on your car, or the control surfaces of the airliner you are on, or controlling the amount of radiation in the X-ray machine that is pointed at your face? Suddenly, random silly problems which only happen once in a blue moon are no longer so silly.
I believe that the trend towards unreliability will not turn around anytime soon, and I expect it to get much worse in the coming decades, because of the following assumptions:
See Why Software is So Bad (MIT Technology Review July 2002)
See Fault Avoidance vs. Fault Tolerance: Testing Doesn’t Scale (Microsoft 1999 Position Paper)
See Embedded Processing Trends, Part 3 (Dec 2009 Embedded.com article)
So the hardware manufacturers are busy advancing embedded systems with scads more memory, interrupts, DMA, and fancy peripherals. In the process, the software complexity must grow, and as a result, faults will be come even more common.
The embedded chips get bigger, faster and can handle more code. More code = more complexity. More Complexity = more failure.
And so the failures will continue until.... What? Morale improves?
Software Bug in Boeing 777 Nearly Dooms Flight
As a Malaysia Airlines jetliner cruised from Perth, Australia, to Kuala Lumpur, Malaysia, one evening last August (2006), it suddenly took on a mind of its own and zoomed 3,000 feet upward. The captain disconnected the autopilot and pointed the Boeing 777's nose down to avoid stalling, but was jerked into a steep dive. He throttled back sharply on both engines, trying to slow the plane. Instead, the jet raced into another climb. The crew eventually regained control and manually flew their 177 passengers safely back to Australia.
A defective software program had provided incorrect data about the aircraft's speed and acceleration, confusing flight computers.
The Fundamental Flaw With Today's Technology
The fundamental flaw with most of our consumer products today is that they have computers in them, and the computer is an inherently mysterious and unreliable device. That statement may seem obviously false, since there are reliable computers and software which can run for a decade without a single failure. But what I'm really getting at is how the reliability is inversely related to the complexity of the system, and our embedded systems are rapidly becoming more complex, and therefore less reliable.
This relationship between complexity and unreliability is a fact of the digital computer running algorithmic software. ( See the famous article There is No Silver Bullet by Frank Brooks)
Searching the web, one can find all kinds of old articles about attempts to fix this unpleasant fact about software. Smart people have worked on this for years, but nothing has changed. We are not on the brink of finding a new magical language, compiler or way of programming which avoids the reliability problem. The consensus seems to be that there is no way to make highly reliable software, so we just kind of try harder and test more.
Therefore, the reliance on digital computers in our devices is a reliance of software, and software is fundamentally flawed for making reliable systems.
On the Other Hand, Smart Extremely Complex and Reliable Devices Clearly are Possible
Not all complex systems exhibit this extreme vulnerability to tiny variations or errors. We have billions of examples of devices which are more complex with more capability that any computer or network thereof, and these devices do not in general have any reliability problems anywhere close to modern computer systems.
So what is the big deal, faulty products are nothing new!
Throughout history, our stuff has failed. Cars didn't start or had problems in 1950, flintlock pistols misfired or wouldn't fire at all, horse saddles wore out prematurely or came uncinched for no good reason because of faulty workmanship or bad design.
The difference is that today
Or another way of putting it is that we have accepted the fact that it is too hard to make products which don't have obvious flaws. This is a new thing.
Nature's Computing Device
The brain is an example of a device which can make decisions, process input and take autonomous actions, and can outperform any computer or network of computers at the tasks it does best. Try watching the Olympic athletes compete and think about what kind of computing power and software would be required to take input from their eyes and other senses, and calculate the correct tensions and positions for their limbs to accomplish their tasks.
This device is so fault tolerant that it can lose millions of neurons, and still operate fine with no detectable difference. There are confirmed medical cases of 6 inch nails being driven into the middle of the device and yet in some cases can still operate just fine.
The normal average person's brain never needs to be rebooted. It doesn't need patches or firmware updates. It is an extremely stable system, and can handle incredibly noisy or incorrect input, and is the master at compensating for all kinds of problems. Nobody has to write software for the brain. Nobody has to try to test every possible input. Instead we 'teach' the brain using examples or practice.
The brain is an example of a complex system which can deal with noisy or partial data, and can automatically compensate for massive failures in the systems it uses. For example, if a dog loses a leg, the dog simply learns to walk without it. No programmer had to write lines of code for this. Blind people can use their senses to function despite a drastic reduction in their input.
"Freak" Software Glitch in Patriot Missle Leaves 28 Soldiers Dead
WASHINGTON, June 5, 1991. The computer failure that blinded a Patriot missile defense system to an Iraqi missile that killed 28 Americans during the Persian Gulf war was similar to a problem discovered in another Patriot battery in Israel five days earlier, Army officials said today.
Army investigators disclosed last month that a "freak" software glitch was to blame when the Scud missile hit an American barracks in Al Khobar near Dhahran in Saudi Arabia on Feb. 25, causing the war's single worst casualty toll for Americans.
Now the human brain is not perfect. There plenty of things that can go wrong with a brain. I'm not suggesting that we abandon digital computers and work only on making artificial brains. What I'm saying is that we have at least one example of a highly capable and complex system which does not get less reliable as its complexity increases. The brain is an example of a system which is not dependent on any small part working perfectly. It is an example of a capable system which does not need every tiny step programmed into it. It is an example of a system which can automatically compensate for massive input and output failure, and sometimes despite a massive internal failure.
If Computers Are Fundamentally Flawed and we can't Make Brains...
Then is there a way to avoid making faulty products? In the short term, no. Digital computers and software are firmly entrenched in the world, and to solve this crisis we need a paradigm shift. To see how this might happen, lets draw a parallel between the development of the automobile, and the development of the computer.
In the early years of the first automobiles, there were a variety of power plant types. Steam cars came first but were limited in range and had long warm up times. Electric cars were popular in the early 20th century, but range and top speed were limited compared to the gasoline powered models. Eventually the gasoline powered vehicles won out, and for almost a century they have dominated the world.
But lately the price of fuel has risen steadily and environmental concerns have suddenly made the gasoline powered vehicle less attractive. It is clear from the great success of the hybrid car that people are ready for a return to electric powered cars. Top speed is no longer an issue, and as the range improves, then the key decider of the fate of the electric car will probably rest on the cost per mile. As most of the world's easy-to-extract oil has been used up, the price for gasoline can only rise, making the return of the electric car seem almost inevitable.
Analog computers have been in use much longer than our modern digital devices. Early devices were mechanical in nature, and later on hydraulic and electrical machines were used. An example of a mass produced analog computer would be the Norden Bombsight used aboard American WWII bomber aircraft to calculate when to drop the payload in order to hit a target on the ground.
Neural Networks were the subject of significant research in the 1950's and later. There was some early success with noise tolerant pattern recognition, but interest waned.
The Digital Computer is unlike the analog computer or the neural network in that it operated in a completely deterministic way, executing instructions precisely. A historic rebuild of the British Colossus electronic computer is shown on the right. It was used to decode german secret codes during World War 2.
Man's Favorite Computing Device
Despite the domination of the gas powered automobile for nearly a century, the popularity of the hybrid cars show that the change is possible, despite the entrenched nature of a certain technology.
The situation of steadily declining reliability of our systems will hopefully begin to generate more interest in alternative systems for the control of devices. We don't actually have any technology even remotely close to the functionality we already have with the digital computer, so it is a bigger leap than from gas powered vehicle to an electric powered vehicle.
Since we have so much invested in digital computers, perhaps the first steps towards a better system is to use existing hardware to implement a new technology in software, and eventually produce hardware once the technology is proven.
Software Errors Cost U.S. Economy $59.5 Billion Annually
Software bugs, or errors, are so prevalent and so detrimental that they cost the U.S. economy an estimated $59.5 billion annually, or about 0.6 percent of the gross domestic product, according to a newly released study commissioned by the Department of Commerce's National Institute of Standards and Technology (NIST).
Software is error-ridden in part because of its growing complexity. The size of software products is no longer measured in thousands of lines of code, but in millions. Software developers already spend approximately 80 percent of development costs on identifying and correcting defects, and yet few products of any type other than software are shipped with such high levels of errors. Other factors contributing to quality problems include marketing strategies, limited liability by software vendors, and decreasing returns on testing and debugging, according to the study. At the core of these issues is difficulty in defining and measuring software quality.
The Challenge For the Future
The assumption that reliability decreases as complexity increases for computer software means that we are currently on a path which leads to more and more failures. Already we are at a point where our common consumer products are failing constantly and for the moment we are simply accepting it. But there must be a certain point at which something will need to change.
Lets start thinking less about speed and features and more about reliability. How can we address this both in the short term, and what steps should we take now to get us headed in the right direction for the future?
Can we get academic research to realize there is going to be a revolution in computing at some point, and that we need some fresh ideas on how to make useful devices which are possibly something completely different?
It doesn't seem likely at this point, but having the news filled with stories of car brakes not working correctly may begin to wake people from their stupor.
|Last Updated on Sunday, 28 February 2010 16:04|