It's not a bug, it's user error (or so they say. via 'the wall). Actually, while it may not have been a bug in the operating system, it was certainly a bug somewhere in the mission. That the system depended on a hunk of flash that was being shared between critical system code and stored science data seems like a pretty big flaw in the design. Since when does my PC go into a reset loop because I downloaded one too many images from the Web?
Posted by asa at February 3, 2004 08:11 PMActually, your machine might go into a reset loop if that last image filled up your hard drive.
Posted by: Bill Ruppert on February 3, 2004 10:10 PMEurope to send people to Mars within 30 years.
http://www.cbc.ca/stories/2004/02/03/mars_esa040203
Between NASA, the ESA, and China coming up fast with their space program... It's gotta happen.
The line forms behind me. No shoving.
Posted by: Bumfluff on February 3, 2004 11:29 PMThe problem wasn't shared flash, it was shared RAM being used to manage the file system.
One person's bug is another's user error; it's a question of whether the specifications really fit together. I'd say the real problem here was that they didn't do realistic duration testing. The longest workout the software got was in the context of a nine-day full-up team exercise, and I also wonder whether they started with the system burdened by leftover files from a simulated cruise phase. This sounds to me like a job for an automated test harness that could run simulated missions with randomly generated events; it ought to be feasible to do that in this situation.
I used VxWorks on a project ten years ago. These real time operating systems (RTOS) aren't like Windows. They are bare bones software layers that insulate the application programs from the hardware. You often have to compile in specific parameters for the amount of resources dedicated to different functions, such as the number of processes that can run at once. When you exceed some of these limits, you don't get a graceful degradation, but rather a hard crash. I can see the difficulty in a large project like this in making sure that everyone is on the same page as far as these resource limitations, which are often rather technical and abstract.
Posted by: Paul on February 4, 2004 01:37 PMThat's right. Keep in mind also that there's no hard disk on this thing; no effectively infinite place to shove overflow in a virtual-memory scheme, like modern computer operating systems have. When the memory's out, that's all she wrote; the whole system's coming down. You can do escalatingly desperate tricks to recover allocated memory you don't really need, but it will only go so far.
I infer from the ComputerWorld article and other comments that the MERs are probably using an off-the-shelf VxWorks TrueFFS file system with dosFs on top -- see
http://www.fsref.com/pr/vxworks/ref/dosFsLib.html for details,
especially the section towards the end on memory consumption.
I'd be curious to know if the MER software team neglected to use a
separate memory partition for the dosFs and hence just filled up the
system memory partition, or if their error responses to memory
exhaustion were somehow flawed. Hanging up just because dosFs memory
is exhausted is really not reasonable behavior for a system of this
class.