Fault-tolerant agents unlink from mailman on a domain manager
|
Cause and solution:
This problem is normally caused by a false timeout in one of the mailman processes on the domain manager. During the initialization period immediately following JnextPlan, the "*.msg" files on the domain manager might become filled with a backlog of messages coming from fault-tolerant agents. While mailman is processing the messages for one fault-tolerant agent, messages from other fault-tolerant agents are kept waiting until the configured time interval for communications from a fault-tolerant agent is exceeded, at which point mailman unlinks them.
To correct the problem, increase the value of the mm response and mm unlink variables in the configuration file ~maestro/localopts. These values must be increased together in small increments (60-300 seconds) until the timeouts no longer occur.