Configuring for end-to-end scheduling with fault tolerance capabilities in a SYSPLEX environment

In a configuration with a controller and no stand-by controllers, define the end-to-end server work directory in a file system mounted under the system-specific ZFS. Then, configure the Byte Range Lock Manager (BRLM) server in a distributed form (see following considerations about BRLM). In this way the server will not be affected by the failure of other systems in the sysplex.

In a configuration with an active controller and several stand-by controllers make sure that all the related end-to-end servers running on the different systems in the Sysplex have read-write access to the same work directory.

The shared ZFS capability can be used: all file systems that are mounted by a system participating in shared ZFS are available to all participating systems. When allocating the work directory in a shared ZFS, you can decide to define it in a file system mounted under the system-specific ZFS or in a file system mounted under the sysplex root. A system-specific file system becomes unreachable if the system is not active. So, to make good use of the takeover process, define the work directory in a file system mounted under the sysplex root and defined as automove.

The Byte Range Lock Manager (BRLM) locks some files in the work directory. The BRLM can be implemented:
  • With a central BRLM server running on one member of the sysplex and managing locks for all processes running in the sysplex.
    Note: This is no longer supported once all systems in a sysplex are at the z/OS® V1R6 or later level.
  • In a distributed form, where each system in the sysplex has its own BRLM server responsible for handling lock requests for all regular files in a file system which is mounted and owned locally (refer to APARs OW48204 and OW52293).
If the system where the BRLM runs experiences a scheduled or unscheduled outage, all locks held under the old BRLM are lost; to preserve data integrity, further locking and I/O on any opened files is prevented until files are closed and reopened. Moreover, any process locking a file is terminated (you must be sure that your OS/390® service level includes PTF UW75787 for V2R9 systems, or UW75786 for V2R10 systems).
To avoid this kind of error in the end-to-end server, before starting a scheduled shut down procedure for a system, you must stop the end-to-end server if either or both of the following conditions occurs:
  • The work directory is owned by the closing system
    • The df –v command on OMVS displays the owners of the mounted file systems
  • The system hosts the central BRLM server
    • The console command DISPLAY OMVS,O can be used to display the name of the system where the BRLM runs. If the BRLM server becomes unavailable, then the distributed BRLM is implemented. In this case the end-to-end server needs to be stopped only if the system that owns the work directory is stopped.
The server can be restarted after a new system in the sharing has taken the ownership of the file system and/or a new BRLM is established by one of the surviving systems.

To minimize the risk of filling up theIBM Workload Scheduler internal queues while the server is down, you should schedule the closure of the system when the workload is low.

A separate file system data set is recommended for each stdlist directory mounted in R/W on /var/TWS/inst/stdlist, where ins varies depending on your configuration.

When you calculate the size of a file, consider that you need 10 MB for each of the following files:
  • Intercom.msg
  • Mailbox.msg
  • pobox/tomaster.msg
  • pobox/CPUDOMAIN.msg.

You need 512 bytes for each record in the Symphony, Symold, Sinfonia, and Sinfold files. Consider one record for each CPU, schedule, and job or recovery job.

You can specify the number of days that the trace files are kept on the file system using the parameter TRCDAYS in the TOPOLOGY statement.