Unsupervised analytics for multi-source time series data: Enabling trend analytics, context-aware profiling and real-time state forecasting

Onderzoeksoutput: PhD Thesis

60 Downloads (Pure)

Samenvatting

There has been an explosion of impressive success stories recently with deep learning approaches in var-
ious fields such as natural language processing, computer vision, healthcare, and robotics. The advent
of transformers has further amplified the capabilities of deep learning models to understand and generate
complex patterns, establishing them as a cornerstone of modern AI advancements across a broad spec-
trum of applications. Initially, transformers revolutionised large language models (LLMs) like GPT-4
and BERT, enabling them to process and generate human-like text with remarkable coherence and accu-
racy. Now, their impressive performance is also being demonstrated in other domains, extending their
impact beyond just language processing. Given sufficient high-quality labelled data and computational
resources, deep learning models are able to achieve levels of accuracy that were previously unattainable.
Consequently, much of the AI research nowadays is devoted to improving deep learning architectures,
leading to the creation of computational models that are increasingly precise, lighter, faster, etc.
Unfortunately, most of the real-world application contexts (e.g., industrial asset operations, produc-
tion processes, and mobility management) generate datasets which significantly diverge from the ide-
alised benchmark datasets used to validate novel AI methodologies. Real-world data is typically charac-
terised with presence of noise, missing values, complicated parameter names, different data types, lack
of ground-truth, context-dependent features, etc. The latter makes it very challenging to immediately
dive into any AI model application since it is often not clear which modelling paradigm would best suit
the problem at hand. This PhD research is built around the conception and validation of a heuristic data
analytics methodology with the primary aim to benefit maximum from the different facets, while at the
same time mitigating the ‘imperfections’, of real-world datasets.
Nowadays, most of the available datasets originating from industrial activities are composed of mul-
titude of different parameters. The inherent multi-source nature of such datasets makes it impossible to
directly integrate different data types without information loss. For instance, the performance of an in-
dustrial asset is impacted by a diverse set of factors (multiple views) such as different operating modes
and settings concerned with the internal working of the asset, and many exogenous factors, such as hu-
man operators or weather conditions. However, it is not always possible to directly link or trace back
certain performance to a distinct operating context due to numerous influencing factors, which are often
also highly interdependent. To address this challenge, a multi-view data integration approach has been
devised as a part of this PhD work, which identifies and considers different data views explicitly, allowing
to fully harness the richness of heterogeneous datasets while retaining all the relevant information.
The ongoing trend of increasingly more data being captured and stored, goes parallel with an in-
creasing complexity of interpreting and extracting valuable insights from it. For instance, the remote
monitoring of infrastructures (e.g., roads, buildings, and power supplies) or portfolios of industrial assets
(e.g., wind turbines, compressors, and pumps) typically generates complex spatio-temporal data streams
captured at high sampling rate across multitude of different locations. Combining and making sense of
such data streams, while still being able to capture and preserve the temporal dynamics per spatial con-
text, is not trivial. In this PhD research, an elegant spatio-temporal profiling methodology is proposed,
allowing to uncover insightful spatial patterns and dependencies while taking full advantage of the tempo-
ral dimension. However, it is crucial to acknowledge that solely relying on intelligent analysis techniques
often falls short in fully uncovering pertinent patterns and relationships in real-world data. On the con-
trary, the human eye can outperform algorithms in grasping and interpreting subtle patterns, provided
it is supported by intelligent visualisations. Therefore, the exciting domain of visual analytics research
has been also explored in this PhD thesis, resulting into the conception of several novel visualisation ap-
proaches, blending advanced visualisation with intelligent analysis to effectively reveal key patterns and
relationships in the dataset of interest.
By far, the hardest challenge associated with the analysis of real-world datasets is the lack of ground
truth, which limits the choice of learning paradigms to only unsupervised ones. Subsequently, the poten-
tial of deriving meaningful insights from such datasets is far from being fully exploited since it requires
creative data science approaches beyond the mere application of AI algorithms. In this PhD research,
a novel data mining and modelling framework is conceived, capable of extracting semantically inter-
pretable states from unlabelled real-world datasets. The latter facilitates a better understanding of system
behaviour in terms of state transitions and also allows to convert the initially unsupervised data mod-
elling problem into a supervised one, enabling the construction of forecasting models. Several different
neural and neuro-symbolic learning workflows have been proposed for this purpose in this PhD work.
Thus, thanks to the creative data analysis phase preceding model construction, these paradigms are en-
dowed with the capability to perform advanced supervised tasks such as modelling transition dynamics,
forecasting future states, predicting forthcoming events, and identifying anomalies.
To evaluate the effectiveness of the conceived methodologies, real-world datasets from two funda-
mentally different application domains have been considered. The first domain relates to wind energy
production, for which high-quality SCADA data collected from an onshore wind farm is leveraged. The
second domain pertains to mobility, where diverse datasets for vehicle detection are utilised, obtained
from ANPR cameras and inductive loops. These practical applications highlight the main contributions
of this PhD research: the development of innovative heuristic data mining methodologies that bridge the
gap between the clean and perfect benchmark datasets used in research nowadays and the reality of noisy
and complex data streams originating from diverse real-world applications.
Originele taal-2English
Toekennende instantie
  • Vrije Universiteit Brussel
  • Sirris, het collectief centrum van de technologische industrie
Begeleider(s)/adviseur
  • Tsiporkova, Elena, Promotor, Externe Persoon
  • Munteanu, Adrian, Promotor
Datum van toekenning3 okt 2024
StatusPublished - 2024

Citeer dit