Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP Integrity Essentials Global Workload Manager User's Guide: A.03.00.00 > Chapter 5 Additional Configuration and Administration Tasks

Automatic Restart of gWLM’s Managed Nodes in SRDs (High Availability)

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

Whenever a managed node boots, the node’s gWLM agent attempts to automatically rejoin the node in its SRD, providing high availability. The only configuration steps you need to perform for this behavior to happen are:

  1. Ensure the /etc/rc.config.d/gwlmCtl file on each managed node has GWLM_AGENT_START set to 1. You can run the following command on each system where gwlmagent is running to make this change for you:

    # /opt/gwlm/bin/gwlmagent --enable_start_on_boot

    In the same file, you also need GWLM_CMS_START=1 on the system where gwlmcmsd is running. However, when you ran /opt/vse/bin/vseinitconfig during installation, this change was automatically made.

  2. (Optional) Edit the property

    com.hp.gwlm.node.HA.minimumTimeout

    in the file /etc/opt/gwlm/conf/gwlmagent.properties to set the minimum number of seconds that must pass before a managed node considers itself separated from its SRD. Set this property to ensure that minor network problems do not cause a managed node to prematurely consider itself separated.

    gWLM uses this value only if it is larger than 10 multiplied by gWLM’s allocation interval. For example, with an allocation interval of 15 seconds, a node can go 2.5 minutes without communicating with its SRD before the node’s gWLM agent attempts to re-connect with the SRD.

This feature works best when one managed node is lost at a time or all managed nodes are lost.

NOTE: If a vpar is borrowing cores from other vpars when it loses contact with its SRD, those borrowed cores may be separated from the SRD. If the vpar may be down for an extended time, check that the SRD has reformed without that vpar and that it has enough cores to meet its commitments. If not, try using vparmodify to reclaim some of the cores. (With the vpar down, you will not be able to modify it locally, and only some versions of HP-UX Virtual Partitions allow you to easily modify a remote vpar.)

Similarly, if an npar has several active cores (due to Instant Capacity) when it loses contact with its SRD, you may have to manually size the npar to reclaim those cores for npars still in the SRD. Refer to the Instant Capacity documentation regarding such issues.

How the Automatic Restart Works

When a managed node boots, the gWLM agent (gwlmagent) starts automatically if GWLM_AGENT_START is set to 1 in the file /etc/rc.config.d/gwlmCtl. The agent then checks the file /etc/opt/gwlm/deployed.config to determine its CMS. Next, it attempts to contact the CMS to have the CMS re-deploy its view of the SRD. If the CMS cannot be contacted, the SRD in the deployed.config file is deployed as long as all nodes agree.

In general, when an SRD is disrupted by a node’s going down or by network communications issues, gWLM attempts to reform the SRD. gWLM maintains the concept of a cluster for the nodes in an SRD. In a cluster, one node is a master, and the other nodes are nonmasters. If the master node loses contact with the rest of the SRD, the rest of the SRD can continue without it, as a partial cluster, by unanimously agreeing on a new master. If a nonmaster loses communication with the rest of the SRD, the resulting partial cluster continues operation without the lost node. The master simply omits the missing node until it becomes available again.

NOTE: Attempts to reform SRDs may time out, leaving no SRD deployed and consequently no management of resource allocations. If this occurs, see the VSE Management Software Release Notes and follow the actions suggested in the section titled “Data Missing in Real-time Monitoring”.

Related Events

You can configure the following SIM events regarding this automatic restart feature:

  • Node Failed to Rejoin SRD on Start-up

  • SRD Reformed with Partial Set of Nodes

  • SRD Communication Issue

For information on enabling and viewing these events, refer to gWLM’s “Configure Events” menu.

You can then view these events using the Event Lists item in the left pane of SIM.

The following sections explain how to handle some of the events.

“Node Failed to Rejoin SRD on Start-up” Event

If you see this event:

  1. Restart the gwlmagent on each managed node in the affected SRD:

    # /opt/gwlm/bin/gwlmagent --restart

  2. Verify the agent rejoined the SRD by monitoring the Shared Resource Domain View in SIM or by using the gwlm monitor command.

  3. If the problem persists, check the files /var/opt/gwlm/gwlmagent.log.0 and /var/opt/gwlm/gwlm/gwlmcmsd.log.0 for additional diagnostic messages.

“SRD Communication Issue” and “SRD Reformed with Partial Set of Nodes” Events

NOTE: Reforming with a partial set of nodes requires a minimum of three managed nodes in the SRD.“SRD Communication Issue” events are not enabled by default. To see these events, configure your events in SIM through the VSE Management menu bar using Tools->Global Workload Manager->Events.

If you have an SRD containing n nodes and you get

n - 1 of the “SRD Communication Issue” events

but no

“SRD Reformed with Partial Set of Nodes” events within 5 minutes (assuming an allocation interval of 15 seconds) of the first “SRD Communication Issue” event

you may need to restart the gwlmagent on each managed node in the affected SRD:

# /opt/gwlm/bin/gwlmagent --restart

Manually Clearing an SRD

If gWLM is unable to reform an SRD, you can manually clear the SRD, as described below.

Clearing an SRD of A.02.50.00.x (or later) agents

The command discussed below is an advanced command for clearing an SRD. The recommended method for typically removing a host from management is by using the gwlm undeploy command.

Starting with A.02.50.00.x agents, you can manually clear an SRD with the following command:

# gwlm reset --host=host

where host specifies the host with the SRD to be cleared.

If the above command does not work, follow the procedure given in the next section.

Clearing an SRD of agents of any version

The following procedure clears an SRD regardless of the version of the agents in the SRD:

  1. Delete the deployed.config file on each managed node:

    # rm -f /etc/opt/gwlm/deployed.config

  2. Force an undeploy of the SRD (named SRD below) to ensure the CMS and the managed nodes agree on the SRD’s state. Run the following command on the CMS:

    # /opt/gwlm/bin/gwlm undeploy --srd=SRD --force

  3. Restart the gwlmagent daemon on each managed node:

    # /opt/gwlm/bin/gwlmagent --restart

NOTE: If the gWLM CMS and agent disagree about whether an SRD is deployed or undeployed, you can use the --force option with the gwlm deploy or gwlm undeploy commands.
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 2004-2007 Hewlett-Packard Development Company, L.P.