Boeing 737 Max. When software takes too much control

There’s an unfolding story right now, of how software killed nearly 400 souls over 5 months.

Its lessons speak to why we take a careful approach to TextBlade software.

—————————————————

TextBlade is all-new technology. It makes keyboards far smarter, less work, and more powerful than ever before.

A big foundation that makes this possible is extensive software engineering.

There is perhaps 100X more software inside TextBlade, than any other real keyboard ever made.

The way we architected that software defines the user experience.

Software produces advances in ease of use, reduction of stress, and many new conveniences.

Done well, you never want to go back to legacy keyboards. It feels magical.

But if it had been done too casually, it would have risked frustrating users. We all hate autocorrect software when it comes up with some absurd overwrite of a very reasonable entry. Which is why TextBlade has no autocorrect agents at all.

Software run amok is much, much worse than no software at all. Which is why we validate so extensively with real, everyday use by our customers.

And it is also why we architect our software with lots of checks and balances, and self monitoring, and auto-recovery to make it resilient, even for quirks in your computer, that are totally outside of TextBlade’s job.

Your life is not at risk from a keyboard. It does not hold your body in mid air, 6 miles up.

But you rely on a keyboard every day to do what you do, more than you realize. So it’s wise to think through what could possibly go wrong, and carefully design that out, through the fundamental architecture itself.

For the Boeing jet however, not doing that right, actually killed people.

——————————————————

What went wrong?

The new jet has bigger, more efficient engines, placed farther forward, which make it more prone to stall. They compensated by adding software to automatically resist stalls.

They fast-tracked approval by making the pilot training unchanged from their popular prior models, to get a leg up on competing more quickly with Airbus.

But the software took control away from the pilot. And it made crucial decisions relying on only a single sensor.

It violated a basic rule of robustly-designed systems - you never want to be vulnerable to a single-point failure, especially in life safety applications.

So when the sensor gave a false reading, that software trusted the sensor more that the pilot.

It put the jet into a hard dive, which the pilots could not overcome. It forced the plane into a high speed nose dive straight into the ground, blowing a large crater, and killing everyone instantly.

This was not some weird fluke event. It was a fundamental, systemic error, in design.

There was organizational failure at multiple levels in Boeing. The engineers at Boeing are plenty smart to understand the design rule of redundancy, and manual override for any system failure. But commercial pressure on managers got them to overrule the engineers, to push the product out faster.

Boeing is not a small company, and we have all flown in their trusted, excellent jets for half a century.

But size won’t save you. Hubris is how even the mighty fall.

When building great change, it pays to stay humble.

https://www.jimcollins.com/books/how-the-mighty-fall.html

—————————————————

Below is an article from the Wall Street Journal that explains what happened to the jet.

‪Boeing needed the launch of its 737 MAX to go quickly and smoothly. This is how it went wrong.

https://www.wsj.com/articles/how-boeings-737-max-failed-11553699239?shareToken=stbdce6a9de3674775859c9f3ac177de9b via @WSJ‬

3 Likes

Engineers are as invested in the success of a product as any other part of the business. Valid concerns overlooked prove terrible too often.

1 Like

The engineers had designed two redundant sensors into the jet, from the start. Every 737 Max has them. So they knew they needed redundancy. They also designed a way for pilots to switch off the automation.

But they had not finished the redundancy administration software at the time Boeing wanted to start shipping.

So they were ordered to issue a partial release, using only one, with the intent to upgrade it later.

And Boeing did not want to alter the pilot training process, since so many pilots are already 737-certified, this was a huge advantage for faster and lower cost adoption.

So those Ethiopian and Indonesian pilots had no knowledge of how to override the MCAS gone awry, even though a way had been engineered to do it.

To give the MCAS full authority over flight controls, in conflict with the pilot, in reliance on a single sensor … this is insane. It is evidence of systemic problems in the architectural review process.

Passengers should never have flown for a single hour with that configuration.

This was a significant management error, owing to commercial pressure to compete urgently with airbus.

Moreover, the whole idea of an expensive angle of attack sensor, of which the jet has only two, is sort of ridiculous in modern day tech.

Every iPhone has a mems sensor capable of knowing the angle of attack with respect to earth ground.

These sensors are now tiny and cheap, and very precise. They could have had a hundred redundant sensors for less than one tenth the cost of just one expensive aerospace style sensor that has now failed twice in the last 5 months.

You could read that array of mems sensors and make decisions based on statistical probability. The odds of the majority of those 100 sensors being wrong is so low as to be absurd.

This was an avoidable problem. Those people did not need to die.

Btw - airbus too, has a similar two-sensor scheme like Boeing. This is an aersoace industry culture problem, where commercial tech has far outpaced them, and they stick to old techniques past where they make sense.

When Boeing had fire problems with their lithium ion batteries on the 787, it was found that they did not have basic thermal meltdown protection systems that are standard on every Tesla car.

We assume that the biggest players all do the smartest thing, but it is not so. Their size creates a kind of overconfident blindness, which only gets fixed after tragedy.

2 Likes

Actually, that sensor has failed many more times than twice in the past 5 months. The flight prior to the fatal Lion flight - had that problem. And there are several entries in the FAA pilots complaints section about the sensor failing on them.

Eyes can be hit by a grain of sand. Sensors can foul or fail. But it is education and engineering and training and knowledge sharing and transparency and diligence and good judgement (on cost vs. risk) and strong regulation and quality checks - that has made the airline industry so safe for so long. But as a society we can observe that every single one of the components I mentioned - is deliberately being eroded - in the name of the almighty dollar.

1 Like

Some good news. Although we know as humans we are fallible (we get tired, we don’t see things in repetitive situations), there are specific things we can do to work around those limitations.

Japanese Standard Pointing and Calling

So yes, although it is extra work for a mechanic to mark a bolt that they tightened, there’s a damn good reason to do so. Don’t un-learn all the great innovations in human behaviour that we use to overcome human fallibility.

And yes, we won’t need to point and call and mark bolts - when machines take over these jobs. Funny how we can’t even have machines build cars in the highly controlled factory environment, yet we are unleashing them to drive on highways, and this fall, in cities!

Colinng - to a nontechnical audience of administrators or bureaucrats, Boeing’s doublespeak explanations might’ve confused the picture enough to be temporarily passable.

But to engineers, if you look at what they did, it’s immediately clear that it was crazy. It’s in fact shocking.

There is no way Boeing management would not have heard howls of protest from their engineers.

Boeing top brass had to know. This was totally playing with fire.

There will be congressional hearings into this, and they will dig until they get the full story. And it won’t be pretty. Many high executives will be found culpable, and will lose their jobs. The company will be subject to much greater independent oversight going forward.

Congress should get a scientist with the stature that Richard Feynman had when they studied the space shuttle loss. He has passed on now, but they need someone with that gravitas and science chops to ride herd on the investigation.

We trusted Boeing.

Then the battery fires made us scratch our heads at how they would take a risk that every cell in the pack had to be perfect to prevent a fire. But the plane could keep flying with a blown battery, even if they had to land abruptly to deal with smoke. But this latest issue is much worse.

Handing the controls to an algorithm, dangling on one sensor - that’s sort of telling you that all decision discipline was lost.

After the pilot reports, Boeing should have themselves insisted on a time-out until they fixed it. But they kept saying it was safe. It wasn’t, and they had to know it wasn’t.

If you had asked us five years ago if we thought Boeing was at risk of imploding, we would have said not likely at all.

But right now, the unthinkable seems like a real possibility. It’s a great company with a great history, but unless there is heroic crisis response leadership now, they may not make it.

When you hand them your body to keep safe, trust is indispensable. It’s everything.

2 Likes

It’s a minor quibble, but the important factor in aircraft stall avoidance is the angle of attack with respect to the surrounding air, not with respect to the earth.

That said, a stall sensor doesn’t have to be expensive. The stall warning system in trainer aircraft is little more than a kazoo that has air drawn across its reed by change in air pressure.

awh_tokyo - yes, 737’s current AOA angle of attack sensors are essentially like weathervanes.

They twist with the direction of airflow. So they can tell if the plane is heading level with airflow, or at angle to it, which may lead to a stall.

Usually you’re flying through air or winds that are fairly parallel to the earth, but sometimes there’s a sudden strong updraft. In those cases, the weathervane - two small canards at the nose of the plane - can detect this and calculate a risk of stall, to adjust.

But the canards are small mechanical vanes. If they ice-up, get hit by a bird, or otherwise jam their bearings, they will read wrong.

Having just one active vane that might get stuck, and totally relying on that to countermand the pilot and drive the plane to crash into the ground - that is crazy. But that’s what happened. Twice in 5 months.

Modern mems sensors know which way the plane is headed relative to the ground. If the software reads velocity with respect to the ground, and altitude, and forward looking radar to see the ground approaching fast, it would know to question the AOA vane. But they weren’t doing any of that. The algorithm just crashed the jet straight into the ground even though the pilot was furiously yanking the controls to go up.

Mems sensors can give you the direction of the G vector, but other mems solid state pressure sensors can also give you air pressure. If you look at air pressure above and below the wing surfaces, you can have yet another source of data to predict if you’re at risk of stall.

Keeping the old AOA vanes is good redundancy, but not availing themselves of the advances in mems sensors to have more data to consult - that is not leading edge engineering. Mems sensors have been in iPhones for 11 years. They had time to consider this. But they stuck with tradition to the detriment of their customers.

Even there though, tradition would have compelled them to consult the two vanes. But they didn’t, and even violated their own tradition too. Add to that, that they gave their algorithm power to override pilot input, and you have a trifecta of f***-ups.

So yes, the plane is flying through a fluid, and the fluid flow might sometimes not be level, but there is no common sense basis to support their multiple wrong decisions.

1 Like

Another way to look at their design is this - if your plane was unlucky enough to fly over an erupting volcano, should your flight control software seek to fly straight into the fluid flow source, and dive into the volcano?

Re the use of mems sensors: afaik these are not utilized in commercial airliners for use in their navigation systems, at least Boeing utilizes laser ring gyros for that.

In addition to the technical f-ups made, the resulting situation is not really that different from a stabilizer trim runaway, which pilots are actually trained to deal with. Just not for this specific kind of “runaway”.

I’m sure there must be a myriad of ways this could have been achieved with simple existing technology. Surely ground speed, relative air speed and an attitude sensor (either a MEMs array or redundant gyros) would be the basis of anti-stall software inputs. This could be further upgraded with wing surface pressure sensors or pitot tube air speed sensors on the top and bottom of the wings to give an indication of air flow and generated lift. Maybe even strain gauges in the wing fore and aft cross members could be an input to show the upwards force of the wings on the aircraft fuselage.
Modern ships use all the basic speed inputs they can get (speed through water, speed over ground, relative wind speed, engine speed, engine shaft torque) to work out the effects of currents and the efficiency of the ship and the propeller through the water. It’s amazing how much fuel you can save sometimes with a simple propeller polish or a hull scrub.
There are even systems now that utilise a strain gauge for an Archers bow, although not legal for most competitions, so ensure that the force applied to the bow, and then ultimately the arrow, is massively consistent from shot to shot.

bulters and S.S. Webster -

Yes, while any of us can be armchair critics without certain details, this episode seems very starkly defined, based on undisputed gross facts.

Relying entirely on one sensor to decide a life or death issue, when lots of other contrary data was available for that code to see - that seems hard to defend.

If radar shows the plane headed straight for the ground, even a stall is less bad than imminent high speed impact. MCAS was still forcing the nose down on impact. There’s no way to rationalize that algorithmic response as safe. From an engineering perspective, this was a glaring mistake.

Ironically, because of aerospace industry conservatism as to possible safety risks, they did not want to add the complexity to weigh the other observables, which ultimately would have made it more safe.

In a weird twist, the conservative culture itself may have killed those people.

To be fair, MCAS, when it’s working properly, may indeed save more lives than it hurts. But when it’s compromised, it would be far safer to auto-disengage, rather than force the jet to fly into the ground.

Auto-pilot in cars today saves more than it kills, but it definitely will self-disengage if it senses faults. That’s in a 50k car, vs a 200m jet.

Oh, dont’t get me wrong, I absolutely agree that horrible mistakes/decisions were made!

But automatically reverting to a safe failure mode has also been the cause of a quite famous incident in which an airbus stalled itself all the way down into an ocean. So MCAS auto reverting into a passive mode might be favourable, but just as dangerous without adequate pilot training.

(Disclaimer: am former airline pilot, 737-900, not the MAX versions)

Hindsight is almost always 20/20, and it’s definitely easy to judge and point fingers when on the outside looking in.

Something I think Waytools themselves are only too aware of.

Lots of time and money can be spent on failure modes and effects analysis (FMEA), but in the end a decision has to be made upon the final output for a given scenario. Unfortunately it’s never that black and white, as was pointed out above. There will always be that statistical anomaly that comes back and bites you.

Unfortunately for Boeing, any mistakes in their industry can have catastrophic outcomes. And for it to happen twice, from memory that seems unheard of in the aviation industry.

Well-trained pilots can save your life, even when faced with horrific equipment failure, through skill and quick-thinking. Thank god we have them.

“Sully” is a famous case of a pilot’s judgment ultimately vindicated as the key to saving those lives.

When adding automated safety systems like MCAS - it is good to follow the Hippocratic oath that doctors pledge -

“First, do no harm.”

Also might help to display something like: “MCAS correcting for stall condition” so the poor pilot can find the MCAS switches and breaker and turn it off.

It’s beginning to sound like it wasn’t that simple unfortunately.

“The crew performed all the procedures repeatedly [that were] provided by the manufacturer but were not able to control the aircraft,”

Ethiopian Airlines Boeing 737 pilots ‘could not stop nosedive’ https://www.bbc.co.uk/news/business-47812225

The Ethiopian pilots had enough training to try turning off MCAS, but not enough to keep it off, and try to regain control purely manually.

They turned MCAS back on and then fought it without success. They did not have enough info and training to be confident to disable it and keep trying manually.

Further, they had little altitude to start with, and not enough time to sort out how to get it under control.

So a compounding of single-point-failure design flaw, with inadequate training to manually override it, and almost no time figure it out.

Often it comes down to the need for very quick, virtually instinctive reactions when equipment failure presents.

1 Like

Without a doubt, time sensitive emergency situations require fast thinking. Fast thinking actually involves no real thought process, it’s instinctive reaction to a given situation. That only comes through repetitive training, which in turn actually gives the operator the confidence to follow what feels instinctive and complete the sequence of events or tasks.

Not having that training leads to a lack of confidence and self doubt. Being able to start from basic principles and working methodically with slow thinking will get you to the end result, but unfortunately emergency situations are rarely not time sensitive.

There are no words for the situation that Boeing put those poor pilots in.

“Dereliction of duty” and “egregious betrayal of trust” come to mind.