Event Correlation
Encyclopedia
Event correlation is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information.
since the 1970s, in network management
and systems management
since the 1980s, in IT service management
and event-based systems since the 1990s, and in business activity monitoring
(BAM) since the early 2000s.
, systems management
and Service-Level Management.
s or Network Management System
s). It is implemented by a piece of software known as the event correlator. This tool is automatically fed with events originating from managed elements, monitoring tools, the Trouble Ticket System, etc. Each event captures something special (from the event source standpoint) that happened in the domain of interest to the event correlator (e.g., the reboot of a device, a Service-Level Objective that is not met for a given customer, or the CPU of an e-business server that is used at 100% for over 15 minutes).
The event correlator plays a key role in the integration of management, for only there do network, system and service events come together. For instance, this is where the failure of a service can be ascribed to a specific failure in the underlying IT infrastructure.
Most event correlators can receive events from trouble ticket systems. However, only some of them are able to notify trouble ticket systems when a problem is solved, which partly explains the difficulty for Service Desk
s to keep updated with the latest news. In theory, the integration of management in organizations requires the communication between the event correlator and the trouble ticket system to work both ways.
An event may convey an alarm or report an incident (which explains why event correlation used to be called alarm correlation), but not necessarily. It may also report that a situation goes back to normal, or simply send some information that it deems relevant (e.g., policy P has been updated on device D). The severity of the event is an indication given by the event source to the event destination of the priority that this event should be given while being processed.
) consists in ignoring events pertaining to systems that are downstream of a failed system. For example, servers that are downstream of a crashed router will fail availability polling.
is the last and most complex step of event correlation. It consists in analyzing dependencies between events, based for instance on a model of the environment and dependency graphs, to detect whether some events can be explained by others. For example, if database D runs on server S and this server gets durably overloaded (CPU used at 100% for a long time), the event “the SLA for database D is no longer fulfilled” can be explained by the event “Server S is durably overloaded”.
) sometimes also include problem-solving capabilities. For instance, they may trigger corrective actions or further investigations automatically.
(the Information Technology Infrastructure Library) is larger than that of integrated management. However, event correlation in ITIL is quite similar to event correlation in integrated management.
In the ITIL version 2 framework, event correlation spans three processes: Incident Management, Problem Management and Service Level Management.
In the ITIL version 3 framework, event correlation takes place in the Event Management process. The event correlator is called a correlation engine.
History
Event correlation has been used in telecommunications and industrial process controlIndustrial process
Industrial processes are procedures involving chemical or mechanical steps to aid in the manufacture of an item or items, usually carried out on a very large scale. Industrial processes are the key components of heavy industry....
since the 1970s, in network management
Network management
Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....
and systems management
Systems management
Systems management refers to enterprise-wide administration of distributed systems including computer systems. Systems management is strongly influenced by network management initiatives in telecommunications....
since the 1980s, in IT service management
IT Service Management
IT service management is a discipline for managing information technology systems, philosophically centered on the customer's perspective of IT's contribution to the business. ITSM stands in deliberate contrast to technology-centered approaches to IT management and business interaction...
and event-based systems since the 1990s, and in business activity monitoring
Business activity monitoring
Business activity monitoring is software that aids in monitoring of business activities, as those activities are implemented in computer systems....
(BAM) since the early 2000s.
Event correlation in integrated management
The goal of integrated management is to integrate the management of networks (data, telephone and multimedia), systems (hosts and applications) and IT services in a coherent manner. The scope of this discipline notably includes network managementNetwork management
Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....
, systems management
Systems management
Systems management refers to enterprise-wide administration of distributed systems including computer systems. Systems management is strongly influenced by network management initiatives in telecommunications....
and Service-Level Management.
Events and event correlator
Event correlation usually takes place inside one or several management platforms (also known as Network Management StationNetwork management station
A Network Management Station is one that executes Network Management Applications that monitor and control network elements such as hosts, gateways and terminal servers. These network elements use a management agent to perform the network management functions requested by the network management...
s or Network Management System
Network management system
A network management system is a combination of hardware and software used to monitor and administer a computer network.Individual network elements in a network are managed by an element management system.-Tasks and operational details:...
s). It is implemented by a piece of software known as the event correlator. This tool is automatically fed with events originating from managed elements, monitoring tools, the Trouble Ticket System, etc. Each event captures something special (from the event source standpoint) that happened in the domain of interest to the event correlator (e.g., the reboot of a device, a Service-Level Objective that is not met for a given customer, or the CPU of an e-business server that is used at 100% for over 15 minutes).
The event correlator plays a key role in the integration of management, for only there do network, system and service events come together. For instance, this is where the failure of a service can be ascribed to a specific failure in the underlying IT infrastructure.
Most event correlators can receive events from trouble ticket systems. However, only some of them are able to notify trouble ticket systems when a problem is solved, which partly explains the difficulty for Service Desk
Service Desk (ITSM)
A Service Desk is a primary IT service called for in IT service management as defined by the Information Technology Infrastructure Library . It is intended to provide a Single Point of Contact to meet the communication needs of both Users and IT employees. But also to satisfy both Customer and IT...
s to keep updated with the latest news. In theory, the integration of management in organizations requires the communication between the event correlator and the trouble ticket system to work both ways.
An event may convey an alarm or report an incident (which explains why event correlation used to be called alarm correlation), but not necessarily. It may also report that a situation goes back to normal, or simply send some information that it deems relevant (e.g., policy P has been updated on device D). The severity of the event is an indication given by the event source to the event destination of the priority that this event should be given while being processed.
Step-by-step decomposition
Event correlation can be decomposed into four steps: event filtering, event aggregation, event masking and root cause analysis. A fifth step (action triggering) is often associated with event correlation and therefore briefly mentioned here.Event filtering
Event filtering consists in discarding events that are deemed to be irrelevant by the event correlator. For instance, a number of bottom-of-the-range devices are difficult to configure and occasionally send events of no interest to the management platform (e.g., printer P needs A4 paper in tray 1). Another example is the filtering of informational or debugging events by an event correlator that is only interested in availability and faults.Event aggregation
Event aggregation (also known as event de-duplication) consists in merging duplicates of the same event. Such duplicates may be caused by network instability (e.g., the same event is sent twice by the event source because the first instance was not acknowledged sufficiently quickly, but both instances eventually reach the event destination). Another example is temporal aggregation, when the same event is sent over and over again by the event source until the problem is solved.Event masking
Event masking (also known as topological masking in network managementNetwork management
Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....
) consists in ignoring events pertaining to systems that are downstream of a failed system. For example, servers that are downstream of a crashed router will fail availability polling.
Root cause analysis
Root cause analysisRoot cause analysis
Root cause analysis is a class of problem solving methods aimed at identifying the root causes of problems or events.Root Cause Analysis is any structured approach to identifying the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes of one...
is the last and most complex step of event correlation. It consists in analyzing dependencies between events, based for instance on a model of the environment and dependency graphs, to detect whether some events can be explained by others. For example, if database D runs on server S and this server gets durably overloaded (CPU used at 100% for a long time), the event “the SLA for database D is no longer fulfilled” can be explained by the event “Server S is durably overloaded”.
Action triggering
At this stage, the event correlator is left with at most a handful of events that need to be acted upon. Strictly speaking, event correlation ends here. However, by language abuse, the event correlators found on the market (e.g., in network managementNetwork management
Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....
) sometimes also include problem-solving capabilities. For instance, they may trigger corrective actions or further investigations automatically.
Event Correlation in ITIL
The scope of ITILInformation Technology Infrastructure Library
The Information Technology Infrastructure Library , is a set of good practices for IT service management that focuses on aligning IT services with the needs of business. In its current form , ITIL is published in a series of five core publications, each of which covers an ITSM lifecycle stage...
(the Information Technology Infrastructure Library) is larger than that of integrated management. However, event correlation in ITIL is quite similar to event correlation in integrated management.
In the ITIL version 2 framework, event correlation spans three processes: Incident Management, Problem Management and Service Level Management.
In the ITIL version 3 framework, event correlation takes place in the Event Management process. The event correlator is called a correlation engine.
See also
- Business activity monitoringBusiness activity monitoringBusiness activity monitoring is software that aids in monitoring of business activities, as those activities are implemented in computer systems....
- Complex event processingComplex Event ProcessingComplex event processing consists of processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time....
- ECA rulesEvent condition actionEvent Condition Action is a short-cut for referring to the structure of active rules in event driven architecture and active database systems.Such a rule traditionally consisted of three parts:...
- Event stream processingEvent Stream ProcessingEvent stream processing, or ESP, is a set of technologies designed to assist the construction of event-driven information systems. ESP technologies include event visualization, event databases, event-driven middleware, and event processing languages, or complex event processing...
- Event-driven architecture
- Event-driven programmingEvent-driven programmingIn computer programming, event-driven programming or event-based programming is a programming paradigm in which the flow of the program is determined by events—i.e., sensor outputs or user actions or messages from other programs or threads.Event-driven programming can also be defined as an...
- Event-driven SOAEvent-driven SOAEvent-driven SOA is a form of service-oriented architecture , combining the intelligence and proactiveness of event-driven architecture with the organizational capabilities found in service offerings...
- Incident managementIncident Management (ITSM)Incident Management is an IT service management process area. The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and...
- Issue tracking systemIssue tracking systemAn issue tracking system is a computer software package that manages and maintains lists of issues, as needed by an organization...
- IT service managementIT Service ManagementIT service management is a discipline for managing information technology systems, philosophically centered on the customer's perspective of IT's contribution to the business. ITSM stands in deliberate contrast to technology-centered approaches to IT management and business interaction...
- Network managementNetwork managementNetwork management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance, and provisioning of networked systems....
- Problem management
- Root cause analysisRoot cause analysisRoot cause analysis is a class of problem solving methods aimed at identifying the root causes of problems or events.Root Cause Analysis is any structured approach to identifying the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes of one...
- Supervisory control and data acquisition (SCADA)
- Systems managementSystems managementSystems management refers to enterprise-wide administration of distributed systems including computer systems. Systems management is strongly influenced by network management initiatives in telecommunications....