Overcoming the Perils of Production:
Part II of Systems Management For The Distributed Object Environment

David S.Newman
Technium, Inc.
Published by Distributed Object Computing Magazine April 1997

There are many challenges and obstacles that a CORBA-based system must overcome in order to survive in the production environment. Object Request Broker Systems Management (OSM) is the insurance policy that an application must have to ensure a successful lifespan in production. OSM refers to the ability to control, monitor, configure and recover application resources that are essential to the proper functioning of the distributed object environment. Organizations that are contemplating the deployment of large scale CORBA systems will require the support provided by an OSM facility. Doing otherwise is the equivalent of information systems sky-diving without a reserve parachute.

 

This article is the second in a two part series that examines the subject of ORB systems management. The first article discussed the OSM lifecycle service for managing CORBA method servers. This second article focuses on the OSM utility services, which are used to further extend the organization's ability to manage and control the production environment. This series is based upon the OSM facility that was designed and developed at Wells Fargo Bank.

 

Server Management is Essential

A managed server is a CORBA method server whose behavior can be supervised, controlled and monitored. Each managed server is given a unique user-definable identity and a set of dynamically changeable properties. The OSM managed server lifecycle service makes it possible for method servers to be orderly launched, terminated, disabled and monitored. Controlling groups of managed servers is the most fundamental requirement for supporting the production environment. Without the ability to control the numerous method server processes that abound in a large-scale distributed object environment, a firm is virtually helpless to remediate problems rapidly and effectively.

 

The need for OSM Utility Services

Although server management is a valuable and necessary function, other services are also critically needed to effectively manage a large multi-platform distributed object environment. The following list of questions suggests some of the additional demands placed on an ORB systems management facility.

These questions represent only the tip of the iceberg of issues surrounding the production management of the distributed object environment. The OSM utility services were developed to erode away this mountain of challenges by providing value added auxiliary support. The following sections describe the intent behind each service, the motivation to engage the service, the OSM solution to each problem, and a brief description of the OSM architecture.

 

Server Recovery Service

 

Intent: Provide a facility that automatically recovers failed method servers.

 

Motivation: When method servers abnormally terminate, users are unable to perform work on the system. Operators are required to continuously monitor the system and manually restart servers.

 

Solution: The OSM managed server recovery service automatically restarts failed servers at a rapid rate, restoring service to users and buying enough time for systems personnel to diagnose and resolve the underlying problem. Operators are not forced to divert their attentions from other critical systems in order to jump-start failed method servers.

 

Architecture: The OSM managed server recovery service is part of a larger framework called managed server maintenance. Server maintenance includes server recovery, state resynchronization, and server health evaluation. The managed server recovery service determines whether the number of running servers has fallen below the minimal threshold required to provide adequate service to clients. The service consequently performs recovery by launching additional servers until the desired number are running. For example, if only two out of ten Customer managed servers are running, the server recovery service will immediately start eight more. The simple reason for taking this proactive approach, rather than just letting the ORB autostart servers, is that normal server startup time might be unacceptably long, especially when servers are attempting to connect to remote backend systems or external service providers. To avoid the consequential delays or timeout conditions that will negatively impact clients, it is important to ensure that servers are running and are available for service before client requests are invoked. (Figure 1).

*

 

Managed Server Maintenance Service

 

Intent: Provide a facility that checks the heartbeat of each running server in the system.

 

Motivation: Detecting the health of servers in an operatorless environment allows the management system to diagnose errors before clients are impacted.

 

Solution: The OSM managed server health check service continuously polls servers to determine how well they are functioning. If it is observed that a group of servers are reporting failures, operators can be quickly alerted by the OSM event notification service, in order to resolve the problem.

 

Architecture: A heartbeat check is performed by sending each managed server an isAlive request. Each server responds to the isAlive request by conducting an internal diagnostic check and returning a disposition. This diagnostic check can also include sending requests to backend systems which certain method servers internally encapsulate, in order to validate the integrity of communication links and health of legacy applications. Based upon the response to isAlive, the health check service might attempt to resolve problems by recycling the affected server or by sending an alarm to operators. The health check service provides a valuable function by detecting and resolving problems independent of client usage of the system. (Figure 1).

 

State Resynchronization Service

 

Intent: Provide a facility that resynchronizes invalid states of managed servers.

 

Motivation: When states of managed servers no longer reflect their actual conditions, the system is no longer manageable. Regaining control over server states is a prerequisite for regaining control over the entire system.

 

Solution: The state resynchronization service detects instances of state incongruity, attempts remedial activities by resetting states to their correct values, and terminates processes if necessary . The server recovery service is then able to restart new servers.

 

Architecture: The OSM state resynchronization service is tasked with the responsibility to resynchronize state information between OSM, the ORB and the operating system. For example, if the current state of a method server is 'running', but it is no longer registered to the ORB agent, nor is it running as a process, state resynchronization must reset the state to 'available'.

 

The state resynchronization service may also initiate corrective events based upon the particular state discrepancy. For example, if the operating system indicates that the method server process is running, but the implementation is no longer registered to the ORB agent, the process will be terminated. (Figure 1).

 

Event Logging Service

 

Intent: Provide a means to capture descriptive information about specific events that have occurred within the system.

 

Motivation: Audit requirements dictate that specific information about business events be captured so that audit trails can be recreated upon demand. In addition, there is demand to generate management reports recapping business activity for specific time periods, error activity, and system performance statistics.

 

Solution: The OSM event logging service captures detailed information about each and every method event that has been invoked within the distributed object environment. Event logging makes it possible to obtain a consolidated view of all methods that have participated within the context of a broader business process. Event data is aggregated into a centralized data warehouse. Management reports can then be generated from the volumes of data that are stored.

 

Architecture: Each managed server is required to log critical events by sending log requests to the EventLog interface. An EventLog is a collection class that contains event log items. Event log items contain a required static set of data items, as well as an optional variable portion of data items. Event logging captures important audit and performance information about events performed by method servers, clients accessing the distributed object environment, and by systems and operations personnel who have altered the configuration of the system. The event logging service includes components that log different types of events, as well as query or access event history. Glossary classes describe the data that is logged. This capability supports a truly universal interface, making it possible for any user to log any type of event. The event logging service is a critical facility that is a prerequisite for other utility services, such as event notification.

 

The event logging service also logs server lifecycle history and exception conditions that have occurred within each server. This information is critically needed when performing problem determination. (Figure 2).

*

 

Event Notification Service

 

Intent: Provide a means to immediately notify operators when specific events occur in the system.

 

Motivation: The overriding need is to rapidly resolve system failures as soon as they occur. The longer that implementation failures of system components go undetected, the longer that service levels to users continue to degrade.

 

Solution: The OSM event notification service immediately detects the implementation failure and alarms computer operations about the nature of the problem, even before clients might realize there is something wrong. Systems support personnel can be quickly contacted to resolve the problem and minimize the impact to users.

 

Architecture: In a firm's distributed object environment, implementation failures and other critical errors may go undetected for an unacceptably long period of time. This is especially evident in an operatorless environment such as the Internet, where customers may experience problems long before operators become aware of them. It is, therefore, necessary to instrument an early warning problem notification system that alerts operators to the existence of problems in a real time mode in order to trigger immediate problem determination and problem resolution.

 

The event notification service focuses on automatically initiating specific notification events when certain other predefined events occur in the distributed object environment. The event notification service associates an event with a specific notification process. For example, an implementation error that is being logged can be associated with a specific action, such as the issuance of an alarm to an operator or perhaps the sending of a message to an automated operations application for automatic resolution.

 

In the event notification service, an event item is pushed to an EventChannel class. The OSM EventChannel class was designed to be a subclass of the CORBA Services PushConsumer Class. The role of PushConsumer is to receive notification that an event occurred at a Supplier in order to trigger an appropriate action at a Consumer. In the OSM event notification service, an OSM managed server is the supplier of the event, and the NetworkConsole, a wrapper class for the firm's network management facility, is the consumer of the action. The EventChannel component provides additional value-added services by filtering out redundant events received from suppliers so that the intended actions, at the consumer, are better managed and controlled. Specifically, alarms to the network management system will only be sent at predefined intervals in order to reduce the bombardment effect upon the console operator. The Network Console implementation internally routes the message to the facility's network management system. In this manner, a constant feedback loop is maintained that supplies an early warning notification system to assist operators in rapid problem determination and resolution. (Figure 3).

*

 

Node Management Service

 

Intent: Provide a facility that can influence the ORB to redirect client invocations to specific nodes.

 

Motivation: In a distributed environment it is imperative to control which nodes should be accessed to service client requests. This is needed to reduce the extent of message queuing and resource consumption at particular machines that, if left uncontrolled, can result in unacceptable performance problems. It is also needed to redirect client requests to alternative machines when a particular machine is temporarily out of service, so that business resumption is automatic and always guaranteed.

 

Solution: The OSM node management service is a facility that is used to determine and control the nodes that the ORB actually searches when facilitating a client's request. By using this service, operations personnel can dynamically switch client access from one node to another. Node management is thus able to influence load balancing across nodes. It also allows the selective pointing of clients to particular nodes, while another node is out of service for software or other maintenance activities.

 

Architecture: Node management functions by returning to each client a list of node names that represent the set of in-service nodes which have been configured to support a particular client's software release. Before invoking business operations, the client requests a node list from the node management service for the client's intended software release. The client passes the returned node list to the local ORB agent, which uses the list to perform an ordered search of nodes to locate instances of running method servers. When responding to client requests, the node manager service toggles the order of nodes returned in the list. This ultimately results in an even distribution of client invocations at multiple nodes, creating a form of load balancing. For example, if three nodes are returned in each list, then each node on the list eventually receives one third of the client requests. (Figure 4).

*

 

It should be recognized that node management is not truly CORBA-compliant, since it requires clients to participate in location-specific processing. The ORB should actually perform this function in a manner that is completely transparent to the client. The absence of this capability in the ORB forced the creation of this service.

 

Conclusion

The OSM framework was funded and constructed by Wells Fargo Bank out of necessity, because of a lack of vendor products engineered to perform OSM services. What is sorely needed is for industry-wide standards on ORB systems management services to be adopted. In support of that need, the Object Management Group (OMG) in cooperation with X/Open are making a concerted effort to define these standards as part of the CORBA Common Facilities domain. However, their efforts have not yet been actualized into commercially available products. The introduction of such vendor products would reduce the development pressures on individual CORBA users, and would thus make the adoption of distributed object technology that much more attractive.

 

Author

David Newman, architect of OSM, is President of Technium, Inc. a Walnut Creek, California, consulting firm specializing in distributed object technology and class-based business engineering. dnewman@technium-inc.com.