 |
» |
|
|
 |
|  |  |
Whenever a managed node boots, the node’s gWLM agent
attempts to automatically rejoin the node in its SRD, providing
high availability. The only configuration steps you need to perform for
this behavior to happen are: Ensure the /etc/rc.config.d/gwlmCtl file on HP-UX or the /etc/sysconfig/gwlmCtl file on
Linux on each managed node has GWLM_AGENT_START set to 1. You can run the following command on
each system where gwlmagent is running to make this change for you: # /opt/gwlm/bin/gwlmagent --enable_start_on_boot In the same file, you also need GWLM_CMS_START=1 on the system where gwlmcmsd is running. However, when you ran /opt/vse/bin/vseinitconfig
during installation, this change was automatically made. (Optional) Edit the property com.hp.gwlm.node.HA.minimumTimeout in the file /etc/opt/gwlm/conf/gwlmagent.properties to set
the minimum number of seconds that must pass before a managed node
considers itself separated from its SRD. Set this property to ensure
that minor network problems do not cause a managed node to prematurely
consider itself separated. gWLM uses this value only if it is larger than 10 multiplied
by gWLM’s allocation interval. For example, with an allocation
interval of 15 seconds, a node can go 2.5 minutes without
communicating with its SRD before the node’s gWLM agent attempts
to re-connect with the SRD.
This feature works best when one managed node is lost at a
time or all managed nodes are lost. How
the Automatic Restart Works |  |
When a managed node boots, the gWLM agent (gwlmagent) starts automatically if GWLM_AGENT_START is set to 1 in the file /etc/rc.config.d/gwlmCtl on HP-UX or in the file /etc/sysconfig/gwlmCtl
on Linux. The agent then checks the
file /etc/opt/gwlm/deployed.config to determine its CMS. Next, it
attempts to contact the CMS to have the CMS re-deploy its view of
the SRD. If the CMS cannot be contacted, the SRD in the deployed.config
file is deployed as long as all nodes agree. In general, when an SRD is disrupted by a node’s
going down or by network communications issues, gWLM attempts to
reform the SRD. gWLM maintains the concept of a cluster for the nodes
in an SRD. In a cluster, one node is a master, and the other nodes
are nonmasters. If the master node loses contact with the rest of
the SRD, the rest of the SRD can continue without it, as a partial
cluster, by unanimously agreeing on a new master. If a nonmaster
loses communication with the rest of the SRD, the resulting partial
cluster continues operation without the lost node. The master simply
omits the missing node until it becomes available again.  |  |  |  |  | NOTE: Attempts to reform SRDs may time out, leaving no SRD
deployed and consequently no management of resource allocations.
If this occurs, stop and start the agents as described in the section ““Node
Failed to Rejoin SRD on Start-up” Event” below. |  |  |  |  |
Related Events |  |
You can configure the following SIM events regarding this
automatic restart feature: Node Failed to Rejoin SRD on Start-up SRD Reformed with Partial Set of Nodes
For information on enabling and viewing these events, refer
to gWLM’s “Configure Events” menu. You can then view these events using the Event Lists item
in the left pane of SIM. The following sections explain how to handle some of the events. “Node
Failed to Rejoin SRD on Start-up” EventIf you see this event: Stop the gwlmagent on each managed node in the affected SRD: # /opt/gwlm/bin/gwlmagent --stop Restart the agent on each of those managed nodes: # /opt/gwlm/bin/gwlmagent Verify the agent rejoined the SRD by monitoring the
Shared Resource Domain View in SIM or by using the gwlm monitor command. If the problem persists, check the files /var/opt/gwlm/gwlmagent.log.0
and /var/opt/gwlm/gwlm/gwlmcmsd.log.0 for additional diagnostic
messages.
“SRD
Communication Issue” / “SRD Reformed with Partial
Set of Nodes” Events |  |  |  |  | NOTE: Reforming with a partial set of nodes requires a minimum
of three managed nodes in the SRD.“SRD Communication Issue” events are
not enabled by default. To see these events, configure your events
in SIM through the VSE Management menu bar using Tools -> Global Workload Manager -> Events. |  |  |  |  |
If you have an SRD containing n nodes
and you get n - 1 of the “SRD Communication
Issue” events but no “SRD Reformed with Partial Set of Nodes” events
within 5 minutes (assuming an allocation interval of 15 seconds)
of the first “SRD Communication Issue” event you may need to: Stop the gwlmagent on each managed node in the affected SRD: # /opt/gwlm/bin/gwlmagent --stop Restart the agent on each of those managed nodes: # /opt/gwlm/bin/gwlmagent
Manually Clearing an
SRD |  |
If gWLM is unable to reform an SRD, you can manually clear
the SRD, as described below. Clearing
an SRD of A.02.50.00.x (or later) agentsThe command discussed below is an advanced command for clearing
an SRD. The recommended method for typically removing a host from
management is by using the gwlm undeploy command. Starting with A.02.50.00.x agents, you can manually clear
an SRD with the following command: # gwlm reset --host=host where host specifies the host with the SRD to be cleared. If the above command does not work, follow the procedure given
in the next section. Clearing
an SRD of agents of any versionThe following procedure clears an SRD regardless of the version
of the agents in the SRD: Delete the deployed.config file on each managed
node: # rm -f /etc/opt/gwlm/deployed.config Force an undeploy of the SRD (named SRD below) to ensure the CMS and the managed nodes agree
on the SRD’s state. Run the following command on the CMS: # /opt/gwlm/bin/gwlm undeploy --srd=SRD --force Stop the gwlmagent daemon on each managed node: # /opt/gwlm/bin/gwlmagent --stop Start the gwlmagent daemon on each managed node: # /opt/gwlm/bin/gwlmagent
 |  |  |  |  | NOTE: If the gWLM CMS and agent disagree about whether an
SRD is deployed or undeployed, you can use the --force option with the gwlm deploy or gwlm undeploy commands. |  |  |  |  |
|