A job remains in "exec" status after JnextPlan but is not running
After running JnextPlan you notice that a job has remained in "exec" status, but is not being processed.
Cause and solution:
This error scenario is possible if a job completes its processing
at a fault-tolerant agent just
before JnextPlan is run. The detail of the circumstances
in which the error occurs is as follows:
- A job completes processing
- The fault-tolerant agent marks the job as "succ" in its current Symphony file
- The fault-tolerant agent prepares and sends a job status changed event (JS) and a job termination event (JT), informing the master domain manager of the successful end of job
- At this point JnextPlan is started on the master domain manager
- JnextPlan starts by unlinking its workstations, including the one that has just sent the JS and JT events. The message is thus not received, and waits in a message queue at an intermediate node in the network.
- JnextPlan carries the job forward into the next Symphony file, and marks it as "exec", because the last information it had received from the workstation was the Launch Job Event (BL).
- JnextPlan relinks the workstation
- The fault-tolerant agent receives the new Symphony file and checks for jobs in the "exec" status.
- It then correlates these jobs with running processes but does not make a match, so does not update the job status
- The master domain manager receives the Completed Job Event that was waiting in the network and marks the carried forward job as "succ" and so does not send any further messages in respect of the job
- Next time JnextPlan is run, the job will be treated as completed and will not figure in any further Symphony files, so the situation will be resolved. However, in the meantime, any dependent jobs will not have been run. If you are running JnextPlan with an extended frequency (for example once per month), this might be a serious problem.
There are two possible solutions:
- Leave JnextPlan to resolve the problem
- If there are no jobs dependent on this one, leave the situation to be resolved by the next JnextPlan.
- Change the job status locally to "succ"
- Change the job status as follows:
- Check the job's stdlist file on the fault-tolerant agent to confirm that it did complete successfully.
- Issue the following command on the fault-tolerant agent:
conman "confirm <job>;succ"
To prevent the reoccurrence of this problem, take the following
steps:
- Edit the JnextPlan script
- Locate the following command:
conman "stop @!@;wait ;noask" - Replace this command with individual stop commands for each workstation
(
conman "stop <workstation> ;wait ;noask") starting with the farthest distant nodes in the workstation and following with their parents, and so on, ending up with the master domain manager last. Thus, in a workstation at any level, a message placed in its forwarding queue either by its own job monitoring processes or by a communication from a lower level should have time to be forwarded at least to the level above before the workstation itself is closed down. - Save the modified JnextPlan.