Configuring for end-to-end scheduling with fault tolerance capabilities in a SYSPLEX environment

In a configuration with a controller and no stand-by controllers, define the end-to-end server work directory in a file system mounted under either a system-specific HFS or a system-specificzFS.

Then configure the Byte Range Lock Manager (BRLM) server in a distributed form (see following considerations about BRLM). In this way the server will not be affected by the failure of other systems in the sysplex.

Having a shared HFS or zFS in a sysplex configuration means that all file systems are available to all systems participating in the shared HFS or zFS support. With the shared HFS or zFS support there is no I/O performance reduction for an HFS or zFS read-only (R/O). However, the intersystem communication (XCF) required for shared HFS or zFS might affect the response time on read/write (R/W) file systems being shared in a sysplex. For example, assume that a user on system SYS1 issued a read request to a file system owned R/W on system SYS2. Using shared HFS or zFS support, the read request message is sent via an XCF messaging function. After SYS2 receives the message, it gathers the requested data from the file and returns the data using the same request message.

In many cases, when accessing data on a system which owns a file system, the file I/O time is only the path length to the buffer manager to retrieve the data from the cache. On the contrary, file I/O to a shared HFS or zFS from a client which does not own the mount, requires additional path length to be considered, plus the time involved in the XCF messaging function. Increased XCF message traffic is a factor which can contribute to performance degradation. For this reason, it is recommended for system files to be owned by the system where the end-to end server runs.

In a configuration with an active controller and several stand-by controllers, make sure that all the related end-to-end servers running on the different systems in the Sysplex have access to the same work directory.

On z/OS® systems, the shared ZFS capability is available: all file systems that are mounted by a system participating in shared ZFS are available to all participating systems. When allocating the work directory in a shared ZFS you can decide to define it in a file system mounted under the system-specific ZFS or in a file system mounted under the sysplex root. A system-specific file system becomes unreachable if the system is not active. To make good use of the takeover process, define the work directory in a file system mounted under the sysplex root and defined as automove.

The Byte Range Lock Manager (BRLM) locks some files in the work directory. The BRLM can be implemented:

With a central BRLM server running on one member of the sysplex and managing locks for all processes running in the sysplex.
In a distributed form, where each system in the sysplex has its own BRLM server responsible for handling lock requests for all regular files in a file system which is mounted and owned locally (see APARs OW48204 and OW52293).

If the system where the BRLM runs experiences a scheduled or unscheduled outage, all locks held under the old BRLM are lost. To preserve data integrity, further locking and I/O on any opened files is prevented until files are closed and reopened. Moreover, any process locking a file is terminated.

To avoid these kinds of error in the end-to-end server, before starting a scheduled shut down procedure for a system, you must stop the end-to-end server if either or both of the following conditions occurs:

The work directory is owned by the system to be closed
- The df –v command on OMVS displays the owners of the mounted file systems
The system hosts the central BRLM server
- The console command DISPLAY OMVS,O can be used to display the name of the system where the BRLM runs. If the BRLM Server becomes unavailable, then the distributed BRLM is implemented. In this case the E2E server needs to be stopped only if the system which owns the work directory is stopped.

The server can be restarted after a new system in the sharing has taken the ownership of the file system and/or a new BRLM is established by one of the surviving systems.

To minimize the risk of filling up the IBM Workload Scheduler internal queues while the server is down, schedule the closure of the system when the workload is low.

A separate file system data set is recommended for each stdlist directory mounted in R/W on /var/TWS/inst/stdlist, where inst varies depending on your configuration.

When you calculate the size of a file, consider that you need 10 MB for each of the following files: Intercom.msg, Mailbox.msg, pobox/tomaster.msg, and pobox/CPUDOMAIN.msg.

You need 512 bytes for each record in the Symphony, Symold, Sinfonia, and Sinfold files. Consider a record for each CPU, schedule, and job/recovery job.

You can specify the number of days that the trace files are kept on the file system using the parameter TRCDAYS in the TOPOLOGY statement.