Loading...

Process Mining and Network Protocols

Probing the application of process mining techniques and algorithms to network protocols

Diploma Thesis 2015 111 Pages

Computer Science - IT-Security

Excerpt

Contents

1. Introduction
1.1. Process Mining
1.2. Business processes and network protocols
1.3. Vision
1.4. Idea, leading questions and strategy .
1.5. Outcome
1.6. Structure of thesis

2. Process Mining and related topics
2.1. The BPM life-cycle
2.2. Process modeling notations
2.3. Positioning process mining
2.4. Process models, analysis and limitations
2.4.1. Model-based process analysis
2.4.2. Limitations
2.5. Perspectives of process mining
2.6. Types of process mining
2.6.1. Play-in
2.6.2. Play-out
2.6.3. Replay
2.7. Discussion
2.7.1. Discovery
2.7.2. Conformance
2.7.3. Enhancement
2.8. Findings

3. Properties and quality
3.1. Event data
3.1.1. Quality criteria and checks
3.1.2. Extensible event stream
3.2. Notation frameworks
3.3. Evaluation of algorithms
3.3.1. Problem statement
3.3.2. What “Disco” does
3.3.3. Challenges for algorithms and notation systems .
3.3.4. Categorization of process mining algorithms
3.3.5. Algorithms and plug-ins for control-flow discovery
3.3.6. Fuzzy Miner
3.4. Process models
3.5. Findings - The weapons of choice

4. Prerequisites and -processing
4.1. Data extraction
4.2. Data transformation
4.3. Load data
4.4. Automating the ETL procedure for TCP
4.5. Findings

5. Proof of Concept
5.1. Mining TCP with Disco
5.1.1. Extracting relevant information
5.1.2. Results
5.2. Discussion
5.2.1. Recorded activities
5.2.2. Sequences
5.2.3. Limitations
5.3. Mining TCP with RapidMiner
5.3.1. Adjustments in the results perspective
5.4. Findings

6. Reasonable applications, adaptions and enhancements
6.1. Mining HTTP
6.1.1. Results
6.1.2. Discussion
6.2. Moving towards bigger captures .
6.2.1. SplitCap
6.2.2. Adaptions to the ETL script
6.3. Protocol reverse engineering
6.3.1. Gathering data
6.3.2. Results
6.3.3. Discussion
6.4. Findings

7. Conclusion

A. Data
A.1. Example PNML file
A.2. Self-captured
A.2.1. tcpCapture.pcap
A.2.2. httpCapture.pcap
A.3. External
A.4. RapidMiner structure .

B. Tools and software
B.1. Disco
B.2. Perl
B.3. ProM
B.4. R
B.5. RapidMiner and RapidProM
B.6. RStudio
B.7. Ruby
B.8. SplitCap
B.9. tshark
B.10. Wireshark
B.11. WoPeD

C. Sourcecode
C.1. Script tcp_pcap2xes.rb
C.2. Script http_pcap2xes.rb
C.3. Script tcp_splitPcaps2xes.rb

D. Glossary

List of Listings

List of Figures

List of Tables

Bibliography

1. Introduction

Process enhancement and conformance checking are often debated issues in many organizations and are in demand in a variety of application domains. Nowadays most processes are backed by or based on information systems. Solutions for Business Process Management (BPM) and Business Intelligence

(BI) provide detailed information about the processes of an organization, mostly being limited to the “to-be” view, without the possibility to monitor or diagnose process execution or real behavior. On the other hand, data-mining techniques are too data-oriented to provide insights to the underlying end-to-end processes.

Every information system produces event logs, whether it is a webserver logging every request and the according answer or an Enterprise Resource Planning (ERP) software that dumps every transaction into continuously growing log files. In most cases these event logs are, if at all, used passively. They are only paid attention to, when there is an incident that has to be investigated or the auditor is asking for them. From the process miner’s point of view this is a waste as event data can be used to discover the “as-is” process automatically and even compare it to the “to-be” process. This is where process mining jumps in, trying to unveil fact-based insight into processes by examining on real-life behavior.

1.1. Process Mining

Van der Aalst et al. define process mining in their manifest as follows:

“Process mining is a relatively young research discipline that sits between computational intelligence and data mining on the one hand, and process modeling and analysis on the other hand. The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in today’s (information) systems.”[79, p. 1]

Whether there already is a BPM or not, process mining is the technology to discover or enhance the processes or check them due to conformity based on event data. There are several tools and algorithms that support extracting and visualizing processes from event logs. Process Mining can be used in a large variety of application domains. The techniques are based on event data written by information systems.

The models describing processes can be discovered by extracting the audit trails of a WorkFlow Management (WFM) system or even the transaction logs of an ERP system. In addition to organizational relations and the manufacture of a product to its final disposal, processes mining can be used for monitoring the process and highlighting deviations to a predefined model or business rules, specified in the Sarbanes-Oxley Act (SOX).[24]

1.2. Business processes and network protocols

In order to understand (business) processes and network protocols, one needs to think about how they are defined and what characteristics both have. There are several definitions for processes and business processes. Hank Johansson defines a process as “a set of linked activities that take an input and transform it to create an output. Ideally, the transformation that occurs in the process should add value to the input and create an output that is more useful and effective to the recipient either upstream or downstream.”[49].

Davenport defines a business process as a procedure to serve a customer or a market with a specified output. To reach this goal, a set of activities is structured and measured. The focus is on what is the input, how work is done and what output has to be produced, giving the process the structure.[10, p. 5] In conclusion, a process produces a defined set of result by following a series of logically related activities or tasks.[46]

Protocols can be seen as formal rules of behavior. It is unimportant if the object under observation is an international diplomatic meeting or a network communication. Protocols consist of sets of rules that minimize misunderstandings and tell everyone involved how to act or react in a certain situation.[43] As the above definitions show, there are similarities between business processes and network protocols. So why not try applying the process mining concepts to network protocols?

1.3. Vision

Discovering rarely used protocols, checking well known and defined protocols for conformance or providing different perspectives on protocols, e.g. on control flow, organization or time, for the purpose of enhancement seem to be viable and valuable accomplishments. Achieving these goals by observing network traffic would open the gates to monitoring and auditing at neuralgic points in networks, without the need for agents or additional modules in information systems to be supervised. The reverse engineering or conformance checking of protocols - either in a forensic approach or “live” - would lead to new opportunities for vendors of network security devices and services just as for auditors or consultants.

1.4. Idea, leading questions and strategy

The idea behind this thesis is to investigate which process mining concepts, types and perspectives are applicable to network protocols. As tools for process mining, the open-source software ProM [78, pp.265269] and the commercial tool Disco [50] are two exemplary representatives. Prom is also available as an extension for the data mining tool Rapidminer [45] named RapidProM [40]. These tools make it possible to elaborate on and explore event logs and processes in many ways and from different points of view. Process mining has already arrived in big institutions from several domains, both private and public sector. Over the past years the list of talks at the Process Mining Camp [51] [52] [53] [54] shows, that banks, financial auditors, business analysts, statistical researchers and advisors already put process mining into practice for many different purposes. The benefit for information security and management may include

- reverse engineering or reengineering network protocols,
- checking conformance of communication or
- enhancing the performance or compliance of communication to name but a few. To accomplish these goals, the information has to be extracted from plain network communication and prepared for the mining process. Additionally the visualization is an issue to elaborate on.

Concerning to focus on control flow of protocols this leads to the following set of questions:

1. Which perspectives and types of process mining are significant to network protocols? - This ques tions will be answered via literature research and investigating information security topics and questions.
2. Which process mining algorithms and notation systems are viable? - This question will be an-
swered through literature research. Finding “the weapons of choice” for process mining network protocols is the main intention here.
3. What are the requirements and prerequisites to process captured network traffic with process min ing tools? - This question will be tackled by a systematic literature research followed by a empirical proof of concept. Finding a procedure to bring the captured or live network traffic to a form, that can be processed by process mining tools like ProM (see [25] ) is the goal to accomplish.
4. What are reasonable applications of process mining in the field of network protocols? - In this
section the methods discussed earlier in this thesis will be applied to concrete protocols.

1.5. Outcome

The findings in this thesis show, that many questions around network protocols and their discovery or conformance can be answered. The combinations of perspectives - control flow, organization, cases and timing or frequencies - and types - play-in, play-out and replay - deliver a wide variety of applications. The properties and qualities of logs, the mining algorithm, the model and notation systems are crucial as they, each on its own, can avoid reaching a satisfactory outcome of the mining procedure. The freedom of choice is additionally narrowed by the few algorithms implemented in process mining tools. For controlflow mining the Fuzzy miner seems to be a good formula, while choosing the mining tool is a matter of one’s goals and technical skills. Disco is the tool of choice, if usability, speed and beautiful visualization are expected. ProM is the more “scientific and technical” tool, offering flexibility and export options for further processing.

Both tools expect eXtensible Event Stream (XES) as input format and the quality of logs is influenced primarily by noise and incompleteness.

The Extract, Transform, Load (ETL) procedure has two major hurdles to overcome:

- Bridging the operational, syntactic and semantic gap between the data source - in this case network captures - and the process mining tools.
- Addressing the above mentioned quality criteria for logs.

Developing an ETL procedure is a complex process and time-consuming. For bridging the above mentioned gap, deep understanding of both the network and the process mining domain is necessary. For a proof of concept the ETL procedure was automated for control-flow mining of Transmission Control Protocol (TCP) by scripts.

For mining the Hypertext Transfer Protocol (HTTP) and dealing with big network captures, the ETL scripts got adapted and enhanced.

To predict the minimum size of a training set or estimating when enough logs are gathered to mine a proper process model, a metric, based on the average information gain over a growing number of cases, is introduced and statistical analysis are carried out.

1.6. Structure of thesis

The structure of this theses reflects the leading questions listed in section 1.4. Following this introduction, chapter 2 gives an overview of process mining and its related topics to find out, how the techniques, perspectives and methods of process mining work out and if they can be applied and lead to benefits in the domain of network protocols and specific questions can be answered.

Chapter 3 is dedicated to the properties and quality of event data, algorithms, the process model and the corresponding notation systems. By literature research the answer to the question of the best combination of the above mentioned dimensions is elaborated.

The prerequisites and the necessary preprocessing to start process mining network protocols are covered in chapter 4. The operational, syntactic and semantic gap between the two domains is bridged by the ETL procedure and automated with a script.

Chapter 5 shows the so far derived expertise and experiences in action. The proof of concept shows the application of process mining to elaborate on the control flow of the TCP.

Chapter 6 highlights more interesting possibilities of application. The ETL process is adapted to the HTTP. To be able to deal with bigger amounts of event data the ETL process for TCP is enhanced. As a culmination of this thesis the problem of finding the adequate amount of event data is addressed. Final conclusions and an outlook on future work are the subjects of the last two chapters.

2. Process Mining and related topics

This chapter provides all basic information and an explanation of the technical environment to address the connections between network traffic and its protocols on the one hand, and the tools and algorithms of process mining on the other.

The first gap to close is the one between the network traffic and the tool that is used for process mining.

2.1. The BPM life-cycle

To understand process mining we need a basic understanding of the BPM life-cycle shown in figure 2.1.

illustration not visible in this excerpt

Figure 2.1.: BPM life-cycle [78, p. 8]

The different phases of managing a business process describe a circle. The phases are:

- Design / Redesign: Here the process is designed as a model.
- Configuration / Implementation: Depending on the existence and maturity level of the WFM or BPM system the model is transformed into a running system.
- Enactment / Monitoring: In this phase the process can be fired at any time and is monitored by the management. The gathered data is the foundation for future enhancements and adaptions.
- Adjustment: Minor changes can be made in the adjustment phase. A redesign of the process is not possible here.
- Requirements / Diagnosis: The process is evaluated here and requirements for a redesign are derived from the monitoring data or external motivations like change of policies or new laws.

While the enactment/monitoring and the diagnosis/requirements phase are more data centric, the primary focus during the (re-) design and configuration/implementation phase is on the process models. However the diagnosis/requirements phase is mostly not supported in a managed way so the life-cycle is only started again, when there are severe problems or external changes take place.

BPM tools have limitations when it comes to supporting the the diagnosis and (re)design phase. The root cause is the missing connection between design and reality and the resulting inability to compare them automatically (see conclusions in [2] ). Process mining offers a way to discover, monitor and improve real processes by analyzing the data recorded by information systems. These more factual information derived from event logs can also trigger the BPM life-cycle.[78, pp. 7-8]

2.2. Process modeling notations

Process modeling deals with the activity of representing processes. The visualization can take place in different notations. Each of the notations has its strengths and weaknesses, most of the time resulting from a trade-off between the ease of use,the universal usability and the field of application. As BPM and Process-Aware Information System (PAIS) (see [17, pp. 5-8] for definition) are depending on process models, the modeling of business and other processes is mission critical. The processes are described in terms of activities, that have to be ordered logically and chronologically correctly. The notations have to provide the opportunity to accomplish this requirement. A more detailed view of the notations and their properties and usability is given in section 3.2.

2.3. Positioning process mining

Process mining creates the links between the processes and their data and the process model. Figure 2.2 shows these links.

Information systems produce a vast amount of event data that is written - most often unstructured - into one or more tables or plain text files. To do proper process mining the extraction and aggregation of the event data is mission critical. Each event logged should consist of the following information:

- Index/Timestamp: Sequential order of events is unambiguous.

illustration not visible in this excerpt

Figure 2.2.: Positioning of the three main types of process mining [78, p. 9]

- Activity: A well defined step in the process.
- Case: A process id that helps to distinguish the process instances from each other.

With these pieces of information the control flow of a process can be extracted from the event log. Depending on the purpose of the process mining, additional information can be stored for other or more comprehensive analysis.[78, pp. 8-9]

2.4. Process models, analysis and limitations

Process models allow many approaches[78, p. 6] to look at processes:

For better insight the modeler is triggered to view the process from various angles. A model also forms a basis for a proper discussion with stakeholders, while documenting a process makes it possible to train other people to a certain procedure or policy, or to achieve a certification. The process model can be used as a baseline for analyzing systems or procedures due to failures or unconformities. Techniques like simulation can be used to examine the performance of a process. Also animation, specification and configuration can be done by using process models.

2.4.1. Model-based process analysis

As mentioned in section 2.4, verification and performance analysis are two main issues in process analysis. As the focus of this thesis is on control flow, verification is more important as it is concerned with the correctness of a process[78, pp. 52].

Verification Two possible tasks for verification are

- checking the soundness of a process and
- the comparison of two models.

“Soundness” means in effect, that from each and every state of a process the end state must be reachable. Anomalies just like deadlocks[1] and livelocks[2] have to be eliminated as they prevent reaching the end state.

Exemplary tools for verification are Woflan[1] and the workflow system Yet Another Workflow Language (YAWL)[31].

Example For better legibility a example, with a given Petri net (see appendix A.1 for details) in form of a Petri Net Markup Language (PNML) file and the tool Workflow Petri Net Designer (WoPeD) (see appendix B.11), will explain how soundness can be tested automatically. Figure 2.3 describes the Petri net under examination. The process shows the handling of requests for compensation and is mined from cases 1 and 4 from event data in [78, p. 13] leading to the model in [78, p. 15]. WoPeD is able to do a semantical analysis (found in tab Analyze → Semantical analysis) on a given Petri net.

Result The result of this analysis is shown in figure 2.4. Some statistics about places, transitions and arcs are listed in the bottom half. More important here is the analysis of soundness (explained in section 2.4.1). WoPeD confirms the soundness of the Petri net, illustrated by the green checkmark.

2.4.2. Limitations

The verification analysis only makes sense, if it is based on proper process models. There are several problems based on “a lack of alignment between hand-made models and reality”[78, p. 57]. The model is useless if it is based on wrong conclusions or represents a too idealized version of the reality. What quality criteria models have to fulfill and how they are measured and quantified is described in section 3.4. According to [78, p. 57] “Process mining aims to address these problems by establishing a direct connection between the models and actual low-level event data about the process.”[78, p. 57]

illustration not visible in this excerpt

Figure 2.3.: Exemplary Petri net

illustration not visible in this excerpt

Figure 2.4.: Semantic analysis results

2.5. Perspectives of process mining

When talking about processes and their variations of execution, the control-flow perspective is taken up and the ordering and logical sequence of activities and the possible or intended variants of execution come to the fore. A comprehensive example for this perspective is given in section 2.6. Additionally there is a non-exhaustive number of other perspectives[78, p. 11] that have to be considered.

Organization The focus is on the resources in a process and how they interact. Actors shall be structured in terms of roles or organizational matters. As activities are put into relation to resources, interesting fields like work distribution, work patterns or roles can be investigated.[78, pp. 221-230] Let us assume, a company policy saying, that for each and every activity in the business processes there have to be at least two persons who are able to successfully do the activity. The organizational mining perspective could answer this question and show every activity that has been performed by only one person and highlight the need for action.

Cases The focus is on the characterization of process executions based on certain values of properties. This perspective is all about the mining of decision trees.[78, pp. 234-237]

Assuming to have a process with a XOR-split, this perspective tries to tell why a certain path in the process execution happens. For this reason the event log has to conclude the decisive factors as a property.

Time This perspective highlights all timing and frequency related topics of processes.[78, pp. 230233]

This perspective can answer questions about service levels. An example could be the proof or ensurance that a certain response time or time to repair is provided to a company’s customers. Another purpose could be the detection of bottlenecks. With this perspective wait times before execution could put a highlight on congestions in processes.

2.6. Types of process mining

There are three types of process mining, namely

- Discovery,
- Conformance and
- Enhancement.

The types of process mining describe the relation between the process model and event data and how they can be translated into each other or checked for conformance. Event data has to contain certain properties to be able to do process mining. Event data will be discussed in section 3.1, however an exemplary event log could look like table 2.1, showing some remarkable characteristics. Each event is described by a case ID, an event ID and an activity. This is the minimal requirement to mine a process model. The case id is the identifier for an instance of the process. The event id has to be a unique identifier within an instance of a process. Either the events are consecutively numbered or the event ID is a timestamp. Both options help to avoid confusion due to the chronological order of events when two executions of the same process are overlapping. The activity represents a step during the execution of an instance.

illustration not visible in this excerpt

Table 2.1.: Exemplary event log (extracted from [78, p. 13])

The following sections provide an overview of the methods and the types of process mining they are used for.

2.6.1. Play-in

When doing play-in the goal is to derive the process model from real behavior described by raw event data. As figure 2.5 describes, there is no need to do process modeling, because the model is inferred from the event data by a process mining algorithm. The features and properties of these process mining algorithms will be considered in section 3.3.

illustration not visible in this excerpt

Figure 2.5.: Play-in [78, p. 19]

When doing play-in, the process model is inferred from behavior. The mission critical part in this procedure is finding a process model that represents - and represents only - the recorded traces in the event log. This type of process mining is used for discovery purposes.

Inferring a process model

Based on the event log in table 2.1 case 1 results in the process model shown in figure 2.6.

illustration not visible in this excerpt

Figure 2.6.: Process model of Case 1

When investigating case 2, another variation of the process appears. Hence the Examine and the Check ticket activity change their order of occurrence. There, the order of execution is irrelevant and they can be parallelized. In addition, the activity Examine casually takes the place of the activity “Examine thoroughly” so these activities are executed either or. After the Decision the request is either rejected or compensation is payed. This leads to the process model in figure 2.7.

Assuming that there the event log contains more cases, the model may need to be extended again until all recorded cases are represented by the model.

illustration not visible in this excerpt

Figure 2.7.: Process model for case 1 and 2

Discovery

The procedure described above is call discovery and can be performed without any a-priori information. Assuming that the event log contains sufficient and comprehensive example executions of the process to be discovered, this leads to a, for example, Petri net describing the process steps. Discovery is used, when there is no precedent information about processes.

2.6.2. Play-out

The basic idea with play-out is to derive behavior from an already existing model.

illustration not visible in this excerpt

Figure 2.8.: Play-out [78, p. 19]

Executions of the modeled process are simulated. The goal is to play-out the complete process model and find every possible path of execution and scenario that is foreseen by the model. The range of possible scenarios ranges from 1 - when there are no optional paths - to infinite, when there is a loop within the process model.

Simulation / Verification

When building an information system or simulating it based on process models, play-out is the thing to do. Verification, e.g. for the purposes of model checking, is also an application domain for this type of process mining.[78, pp. 18-19]

2.6.3. Replay

This type of process mining needs both, the event log and a process model. As figure 2.9 shows, the reality, represented by an event log, is replayed on top of the process model.

illustration not visible in this excerpt

Figure 2.9.: Replay [78, p. 19]

Replay can be used for the following purposes[78, pp. 19]:

- Conformance checking: Deviations of the log and the underlying process model can be detected and investigated by replaying the log and inspect the traces that show the deviation.
- Extending the model with frequencies and temporal information: To figure out which parts of the model are frequently used and tend to be bottlenecks one can consider the timestamps and the number of executions. As of the main focus of this thesis is the control flow, this will no be further examined.
- Constructing predictive models: In some cases the execution time , e.g. when service level agreements or guaranteed response times come into play, is relevant. Through replaying logs on the process model one can learn to predict completion time from any state of the process. Again this is not about the control flow an therefore not relevant in this thesis.
- Operational support: This can be accomplished by live replay during the execution. While executing the process on top of a model, deviations or other flaws can be detected and give the opportunity to influence the current execution. Focus of this technique can be, among others, the control flow.

Conformance checking

To give a concrete example, the following assumption describes a situation, where the conformance check highlights a flaw.

illustration not visible in this excerpt

Table 2.2.: Additional event (extracted from [78, p. 13])

Assumption In addition to the cases in table 2.1 there is another case shown in table 2.2.

When replaying case 3 on top of the model in figure 2.7 one can see, that the event with the ID 114 is not possible with this model and therefore seen as a deviation and in case of a conformance check it is bound to fail. If the behavior represents a valid use case, the model is incomplete.

2.7. Discussion

The leading question of this discussion is the significance of the types of process mining. The following paragraphs will elaborate on ideas and specific applications of process mining methods and types on network protocols. Additionally, a preview of the estimated advantages and benefits is given.

2.7.1. Discovery

Assumption 1 A news company runs a webserver providing news of several domains, e.g. politics, sports, culture to name but a few. The marketing wants to know:

- How do visitors of the website navigate?
- Which domains are the most visited or visited first?
- How much time do visitors spend on the website?

All these questions can be answered through discovery techniques and mining of the control flow. The TCP and HTTP packet headers contain every bit of information, which is needed to answer the above questions. This can serve as a basis for further investigations or profiling visitors of the website.

Assumption 2 A less widely used TCP-based network protocol should be investigated and reverse engineered. The packet headers contain the control characters and words.

The control flow of the protocol can be mined and visualized with process mining techniques. A potential attacker could gain deeper understanding of the protocol as a basis for attacks.

2.7.2. Conformance

Assumption 1 When thinking of an Intrusion Prevention System (IPS), ensuring the correct control flow of the network protocol is a main task. Deviations from the standardized behavior could be a pointer towards a possible intruder break-in, trying to exploit a weakness of the protocol. Replaying the observed behavior an top of a nominal model can highlight these deviations.

Assumption 2 If certain activities in the expected control flow are missing or malformed, this could point to a weak or misinterpreted implementation of a network protocol.

Again replaying the observed behavior an top of a nominal model during a test can highlight these deviations.

2.7.3. Enhancement

The goal of enhancement is to extend or repair an existing process model based on observed real-life behavior. Enhancement comes into play, when the model does not cover every aspect of reality. As network protocols are well defined or even standardized, enhancement plays a minor role in this context. All behavior not agreeing with standards has to be classified as a deviation.

2.8. Findings

As described in section 2.7 there is a plethora of application opportunities for control-flow focused process mining due to network protocols promising better insight in this field. How logs need to be prepared is described in chapter 4 while chapter 5 shows a proof of concept and how ideas can effectively be put into practice.

Besides the control-flow perspective the other perspectives listed in 2.5 can also be taken in to account and combined arbitrarily with process mining types. This opens a wide variety opportunities to look at network protocols. Specific questions that require observing other characteristics of the behavior of a network protocol can be answered with these approaches.

The organizational perspective can provide a understanding which computer systems interact and reveal key players in a network. Taking the case perspective makes it possible to understand in which variations the protocol is commonly used and examine on them. Often protocols also contain restrictions due to timing and frequencies which can be elaborated by taking the time perspective into consideration.

3. Properties and quality

There are three key components in the field of process mining, namely the event data, the process model - pictured using a notation framework - and the mining algorithms. All these components have to have certain properties and fulfill certain minimum requirements regarding quality.

illustration not visible in this excerpt

Figure 3.1.: Event, model and algorithm

Each component depends on the properties of the others as figure 3.1 shows. The following sections will explain the properties and quality criteria of the components and elaborate on the inter-dependencies. Another important thing to keep in mind is the representational bias. The notation system (e.g. workflow nets, petri nets et al.) of the process model has to be able to represent every aspect the mining algorithm is capable of discovering and vice versa. The following sections deal with this issue.

3.1. Event data

Event data is logged in many different forms and formats. Almost all information systems include a log mechanism but the implementations vary very widely. As most common forms plain text files, databases and datawarehouses should be mentioned. Figure 3.2 shows a generic structure most event logs of PAIS follow.

illustration not visible in this excerpt

Figure 3.2.: Structure of event logs[78, p. 100]

The process is the hierarchically highest instance in this structure. Examples for a process could be a business sales process or a service provided by a server (e.g. a running Apache webserver[27] ). Cases are instances of a process, identified by a primary designator (e.g. by a PID on a Linux server[5] ). A case consists of one or more events, whereas the event consists of attributes that described what and when it happened.

This leads to the smallest set of information to do process mining assuming, that the event log consists of events of only one process:

A case id is necessary to distinguish several instances of a process. Events belong to a certain case. The chronological order of events is crucial to process mining. In most cases this is accomplished by a timestamp or a continuous counter.

As activities often last longer, they run through the so called transitional life-cycle. The stages of an activity are represented by an additional attribute that could take on values like start, suspend, resume or complete.[78, pp. 139-140]

3.1.1. Quality criteria and checks

The main quality criteria of event data are noise and incompleteness. the following subsections describe these criteria and point to possible checks to measure and quantify them.

Noise

If the event log contains rare and infrequent behavior not representative for the typical behavior of the process, this is called noise. Noise are exceptional events rather than incorrectly logged events. The discovery algorithm can not distinguish incorrect logging from exceptional events. It is therefore the responsibility of the human to judge and to do proper pre- and postprocessing of the extracted event log and avoid incorrect logging at an early stage.[78, p. 148]

Incompleteness

If the event log contains too few events to be able to discover some of the underlying control-flow structures, this is called incompleteness.[78, p. 149] For process mining a big-enough log is mission critical.

Checks

For quantifying how serious the above mentioned criteria have to be taken, the log can be inspected by cross-validation. When applying the k-fold cross-validation to a log, the data is split into e.g. 10 subsets and each is validated against the others as [78, pp. 85-88] and [78, pp. 149-150] describe.

3.1.2. Extensible event stream

Among the forms of event logs there is no generally acknowledged format. While its main focus is on process mining, XES tries to handle the above challenges and follows four guiding principles namely simplicity, flexibility, extensibility and expressivity. This implies, that only elements appearing in every event log are explicitly defined, while the others are optional attributes.[42, p. 1]

illustration not visible in this excerpt

Figure 3.3.: Meta model of XES[78, p. 109]

Figure 3.3 shows the XES meta-model. This leads to a exemplary syntax as shown in listing 3.1.

[...]


1 A situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does.

2 Similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing.

Details

Pages
111
Year
2015
ISBN (eBook)
9783668066120
ISBN (Book)
9783668066137
File size
2 MB
Language
English
Catalog Number
v308134
Institution / College
St. Pölten University of Applied Sciences – Informatik & Security
Grade
1
Tags
Process Mining Event data discovery conformance enhancement process prom disco algorithm notation system

Author

Share

Previous

Title: Process Mining and Network Protocols