The project aims to improve the effectiveness of process mining by detecting and repairing imperfections in event logs. Using a patterns-based approach, the project systematically addresses common data quality issues such as timestamp errors, misordered events, and duplication. This enhancement ensures more accurate process mining analyses and helps organizations in sectors like healthcare to monitor, analyze, and improve their internal processes efficiently.
The focus is on creating algorithms and solutions to detect, classify, and repair data imperfections in event logs, ultimately leading to higher quality data and more reliable process mining results. The project utilizes a variety of techniques from process mining, data mining, machine learning, gamification, artificial intelligence, and statistics to develop detection and repair solutions.
Specifically, the project involves the creation of new algorithms, with examples including the use of Large Language Models (LLMs) to enhance detection and repair processes. These advanced techniques ensure efficient and effective improvements in event log data quality, allowing for robust process mining insights.
The Unanchored Event pattern arises when the timestamp format in the event log differs from the tool's expected format, leading to incorrect event ordering. Common issues include confusion between day-month and month-day formats, or inconsistent symbols and timezones.
The project developed a series of algorithms to analyze and repair timestamp values by detecting deviations from the expected standard. The key steps are:
Once these algorithms identify and flag issues in the timestamps, they initiate a repair process that corrects the timestamps, ensuring consistent event ordering.
The Form-Based Pattern occurs when multiple events are logged from a single form as separate parallel events. This issue causes flattened event orders, unnecessary duplications, and results in overly complex process models. The project implements two distinct approaches to detect and repair this issue:
This approach involves manually detecting and aggregating events that have the same timestamps and case IDs.
Pros: Provides a manual, user-driven process for greater control over event aggregation.
Cons:
This approach utilizes machine learning models to automatically detect and aggregate events based on their timestamps and case IDs.
Pros:
Cons: Requires a well-trained AI model, which may require upfront setup and tuning costs.
Inadvertent Time Travel occurs when timestamps are recorded incorrectly, leading to misordering of events in the process log. This pattern is typically caused by human errors, such as incorrect timestamps near midnight or adjacent key presses. Two distinct experiments were conducted to detect and repair these issues:
This approach relies on manually defined rules that are applied during program execution to all event logs.
Pros: Simple and easy to implement for rule-specific scenarios.
Cons: Limited in scope since it requires manual input of rules and is not fully automated.
This approach leverages a Large Language Model (LLM) to provide a more intelligent and automated solution for detecting and repairing Inadvertent Time Travel patterns.
Pros: Fully automated, can adapt to various cases without needing predefined rules, making it more generalizable.
Cons: The LLM-based approach is more costly due to the need for model fine-tuning and runtime usage costs.
Both the rule-based and LLM-based approaches show similar results in terms of detection and repair accuracy. However, there are significant differences: