Table of Contents
This chapter describes how to replicate a node manager to recover the node manager when it was terminated abnormally.
A node manager is used to operate and terminate a server and restart a server when the server has a failure. Since a node manager also has a failure, it can be replicated and a replicated node manager can operate instead of a node manager with a failure.
If a node manager is replicated, active and standby node managers will operate. Only active node manager processes requests of DAS or servers and manages processes.
A standby node manager only checks status of an active node manager. If the standby node manager detects a failure in the active node manager, it determines that the active node manager is terminated abnormally and replaces the active manager. Another standby node manager is executed to monitor the previous standby node manager. The detailed procedure is as follows:
Active and standby node managers start at the same time.
The standby node manager monitors status of the active node manager.
The standby node manager may detect a failure in the active node manager.
The standby node manager prepares for another standby node manager.
The existing standby node manager becomes an active node manager and processes requests from a server.
Status of a node manager is checked through a port set in a configuration file. If the port is not set, it will be regarded that node manager replication is not used.
Whenever a standby node manager becomes an active node manager due to a failure in an active node manager, a standby node manager is executed additionally and monitors the previous standby node manager.
To replicate a node manager, configure a port used to send and receive messages between an active node and a standby node. This port can be configured with the standbyPort item as described in "2.3.1. Configuration File". In order not to replicate a node manager, do not configure this port.
A standby node manager starts while an active node manager starts, and then monitors the active node manager's status in standby mode.
[nodemanager-1] [NodeManager-0201] The standby node manager is starting. [nodemanager-1] [NodeManager-0102] Initializing the node manager configuration.
Since a standby node manager does not process requests of a server, it saves status information to a log file. This log file is created in which a node manager log file is located with the name of a node manager name followed by the '_standby' string. This log file is used only by standby node managers.
When a standby node manager becomes an active node manager, this information is recorded in log for standby node managers, and then log for active node managers will be used by the node and the log for standby node managers will be used by a new standby node. Therefore, information related to server requests and management is recorded to log for active node managers, and information about start of standby node managers and communication with an active node is recorded to log for standby node managers.
A standby node manager gets PID of an active node manager when it starts and records the information as log. An active process can be found through the PID.
A replicated node manager operates without downtime because a standby node manager replaces an active node manager that was terminated abnormally. Therefore, to terminate a replicated node manager, use the stopNodeManager script. This script makes an active node manager send a termination message to a standby node manager. The standby node manager ends a connection, records log about its own termination, and then terminates reliably.
If a node manager is terminated forcibly due to an issue, it may be not terminated completely because its standby node manager starts. Therefore, the node manager must be terminated after terminating its standby node. It is recommended to use the stopNodeManager script because forcible termination may record log incorrectly.