Introduction to FSW Design Philosophy

Introduction to FSW Design Philosophy

The FSW group's design philosophy is based on a few key concepts:

Risk Management
Several design decisions minimize the ability of the FSW, however aberrant, to endanger the mission:
- active monitoring
  Each processor is monitored by a hardware watchdog circuit. If this is not "refreshed" on a timely basis, it will force a reboot of the procesor. In support of this function, the FSW Watchdog task performs periodic surveys of the other tasks; if any task fails to respond properly, no refresh message will be issued.
- limited responsibility
  Although the FSW controls some environmental settings (e.g., temperature), only fine control is being managed. All safety limits are maintained by the Spacecraft hardware and software.
- prophylactic rebooting
  If the FSW encounters a serious processing error, it does not try to continue. Instead, it reboots the processor involved, saving a memory image for diagnostic evaluation. This minimizes the effects of "bit flips" in RAM, etc.
- self-protecting hardware
  The LAT hardware is designed to be self-protecting; nothing that the FSW does should be able to harm it. Similarly, the spacecraft and GBM are not supposed to honor FSW requests that would put them at risk.
Flexibility
Both the LAT instrument and the FSW are designed for extreme flexibility:
- instrument configuration
  The LAT instrument contains more than two million bits of configuration data, stored in ~100,000 configuration "registers". These can be used to disable sensors, compensate for changing component characteristics, etc. Various sets of configuration settings can be used to achieve desired scientific results, manage environmental demands, etc.
  A fresh configuration is loaded before each observation session and dumped at the beginning and end of the session. This gives ground-based engineers the ability to analyze how the configuration (including any mid-session changes) might have affected the observed data.
- software modification
  Aside from the Primary Boot Code, the FSW can be updated or replaced (e.g., from ground-based telemetry or another processor). This can be used to compensate for hardware failures, repair late-surfacing software bugs, or institute entirely new behavior.
Hardware Redundancy
Many portions of the LAT hardware are duplicated, allowing errors to be detected and new hardware to be swapped in. For example, there are spare EPU and SIU processors, a spare GASU, and hardware correction and detection for all memory.
Parallel Development
The FSW group's suite of code management tools improves its ability to work in parallel. An engineer can develop and test code changes, using production-quality (but limited-capability) versions of other programmer's software. As new features become available, they can be "published" in Development and/or Production versions.

These design decisions greatly simplify the lives of the FSW engineers. Because they do not have to concern themselves with complicating factors (e.g., mission-critical issues, (fail-soft) performance in the face of failures), they are freed to use simpler designs. This, in turn, leads to speedier development and more reliable software.