GLAST/LAT > DAQ and FSW > FSW

Introduction to FSW Memory Leak Handling


The FSW group has adopted a strategy that makes memory leaks, if not impossible, at least very traceable. The rules of the game are:

  1. All resource allocation is performed at task initialization.

    Memory allocation is the most common form of resource allocation, but there are certainly others: semaphores, file handles, timer entries, etc. A task (or in most cases, a collection of tasks) allocates the resources it needs at initialization time. It does not, in general, share these resources with other tasks.

    If a task runs out of memory, one of two things has happened: either it failed to allocate enough memory initially or it has an internal memory leak. The first problem should only affect performance and is easy to adjust, if need be.

  2. Allocated memory is never shared.

    This rule is designed to limit the problem to the collection of tasks owning the resource; i.e., internal bugs should stay internal. The developer should be able to track down where his resources are going without having to consider the system as a whole.

    This avoids the problem of a rogue task leaking memory like a sieve and causing some other (well-behaved) task to get the short end of the stick. The downside of this philosophy is less efficient use of memory. In a real-time, embedded environment, it is a trade-off we are willing to make.

Allocators

To facilitate this philosophy, PBS (Processor Basic Services) provides a couple of memory managers: FPA (Fixed Packet Allocator) and RNG (Ring Buffer Allocator). Others will be added, as needed. Typically, during task initialization, one uses malloc to allocate a chunk of memory which is then handed over to FPA or RNG to be managed. On task shutdown, these resources are returned. Given that startup and shutdown are rare and well-defined events, memory leaks via the malloc route should also be easy to track down. Admittedly, in this case, the real trick is realizing that there is a problem.

Drivers, the Exception to the Rule

For performance reasons, drivers must loan their memory out to (or borrow it from) other tasks. FSW tries two tactics to ameliorate this problem:

  1. In any packet of memory that is being lent out, reserve a word to identify the loanee. One can then scan the packets that the driver allocated to see who has them. (The driver always maintains a record of the memory it owns.)

  2. Try to avoid having the driver do driver-specific allocation.

The latter approach is preferable, but not always practical. The LCB driver never allocates any packets, but it provides other tasks with pointers to memory locations in a shared ring buffer. So, sharing is indeed going on. Unfortunately, for efficiency reasons, we must live with this.

This philosophy is (or soon will be) more successful in the 1553 driver. The driver will read the message from hardware and, after verifying the message's integrity, call a task-supplied memory-allocation routine. If the allocation routine succeeds, the driver will copy the message into the task's memory, then dispatch the message to the task.

If the task servicing a given APID has squandered its memory (or gotten behind in its processing), messages to it may get discarded. Tasks servicing other APIDs, however, will not be impeded. Thus, the 1553 implementation recovers some of the original philosophy.

Bad Pointers

Given that we are working in a real-time environment, there is no way to totally guard against bad pointers. Any piece of code that first checks a pointer for integrity and then uses it in a non-interlocked fashion always has a hole between the checking and the usage. Doing it "correctly" would be tantamount to writing a single-threaded piece of code.

That does not mean to imply that anything not 100% effective is worthless; just don't believe that there is a "magic bullet". Better checking by users would help, but the cure could be worse than the disease:

    ptr = access_control_block (); 
    if (ptr == NULL) return BAD_POINTER;
    
Although this has some value, the amount of area this covers (in the space of possible errors) is small. A better check is to plant some integrity information in the object being accessed. Unless you have some reason to believe that access_control_block can return a NULL as part of its normal course of doing business, this check is next to worthless. Any other bad value it returns is the result of some overwrite/corruption problem that this test provides no protection against. Something like:
    ptr = access_control_block 
    if (ptr->self_pointer != ptr) return BAD_POINTER;
    
is much better. If ptr is bad, or the structure it references has been corrupted, this is a much stronger (but still imperfect) test.

There is an old expression, "don't check for error that you don't plan on handling". There is some truth in this statement, but it is misleading. One should separate error detection from error correction. Detection is a worthwhile activity, even if you don't know what to do (except halt the system).

At least you can stop the problem as close to its origin as possible. Allowing the system to continue is like spreading a disease. Sooner or later, some innocent victim is going to use a bad piece of information, possibly corrupting one of their data structures. Tracking this mess back to its origins is, well, a mess.