mars, mozilla, and cult3D || MAIN || tux a household name?

February 01, 2004

up and down

Daniel Morris, over at From Behind the Wall of Sleep, makes the very good point that many of us have, well, neglected Spirit in our coverage of the MER missions over the last week. He's right.

I certainly haven't been as loud about the terrible flaw or the amazing recovery here at the blog as I have been out in the real world. Maybe that's actually part of it - I've just done enough talking about it outside of the blog that I didn't feel the need to do my talking here.

I've had several long discussions at work and elsewhere about "what went wrong" with Spirit. I work in software development and so the "failure" has been a regular topic of discussion between myself and a couple of my colleagues. Was it poor design? Was it just a lack of testing? If it was either of those, then why? Was the schedule just too compressed? Were there scenarios that were too expensive to test? Or, did the team knowingly push it beyond its capabilities?

For the last 10 days, we've tossed around a lot of speculation and in the end, it's just that, speculation. NASA/JPL haven't provided enough information to know exactly why this happened. From what they have said, I suspect that it happened like most software bugs happen, a combination of all the factors I listed above, schedule pressure leads to inadequate testing, and when the software is pushed into scenarios that haven't been fully tested, it fails.

All complex software has bugs and there are always real-world constraints that prevent 100% test coverage. The good news with the Mars Exploration Rovers is that the entire system was robust enough that the problem was able to be diagnosed, and repaired or worked around. The redundancy (the "safe mode"), and the integrity of the vital communications system have given the rovers the opportunity to continue on after what could have been a fatal bug.

I certainly am pleased (and blogged about it) that Spirit is back on her feet and is expected to fully complete both the nominal and extended missions. I share Daniel's view that "the good folks in Pasadena, of course, under-promised and over-delivered." They all deserve kudos for completing the very difficult simultaneous tasks of getting Spirit up and getting Opportunity down. Go team!

update: Today's press release has all the happy Spirit news.

Posted by asa at February 1, 2004 03:59 PM
Comments

Heck, what could you say? Spirit’s been sitting there like the Tin Man waiting for Dorothy and the Scarecrow to apply some lubrication. As the next few months unfold, I expect to be surprised and delighted in turn by both rovers; it was just Opportunity’s turn…

Posted by: Daniel Morris on February 1, 2004 11:04 PM

You had a quote from the Q&A on January 26: "We exercised the vehicle like we expected to see it on the surface. Longest tests were about 9 days. We're on sol 18. You can't fully simulate that and we found an issue that we were unaware of it because of the accumulation that happens." Right there that points to a lack of testing. Seems like they should have been able to run a software simulation for more than 9 simulated days. How do we know the software's not going to crap out with a new fault in another 2 weeks? I guess we just cross our fingers, nobody's tested it that long.

Posted by: Paul on February 1, 2004 11:51 PM

It's both surprising and fascinating that an apparent software problem could be almost calamitous. Possibly the rigour with which the other systems were tested was not applied to the software: I'll read the 'post mortem' with interest.
Given that no software is bug free and hardware faults may expose unknown weaknesses perhaps more robust disaster recovery hardware and software would be advisable in future.
Congratss to the engineers for a great recovery! Their 'finest hour'!

Posted by: Julian on February 2, 2004 03:14 AM

Post a comment