Advances in mining complex data: modeling and clustering
Mostra/ Apri
Creato da
Ponti,Giovanni
Greco,Sergio
Palopoli,Luigi
Metadata
Mostra tutti i dati dell'itemDescrizione
Formato
/
Dottorato di Ricerca in Ingegneria dei Sistemi ed Informatica, XXII Ciclo,2009; In the last years, there has been a great production of data that come from
di®erent application contexts. However, although technological progress pro-
vides several facilities to digitally encode any type of event, it is important
to de¯ne a suitable representation model which underlies the main character-
istics of the data. This aspect is particularly relevant in ¯elds and contexts
where data to be archived can not be represented in a ¯x structured scheme,
or that can not be described by simple numerical values. We hereinafter refer
to these data with the term complex data.
Although it is important de¯ne ad-hoc representation models for complex
data, it is also crucial to have analysis systems and data exploration tech-
niques. Analysts and system users need new instruments that support them
in the extraction of patterns and relations hidden in the data. The entire
process that aims to extract useful information and knowledge starting from
raw data takes the name of Knowledge Discovery in Databases (KDD). It
starts from raw data and consists in a set of speci¯c phases that are able to
transform and manage data to produce models and knowledge. There have
been many knowledge extraction techniques for traditional structured data,
but they are not suitable to handle complex data.
Investigating and solving representation problems for complex data and
de¯ning proper algorithms and techniques to extract models, patterns and
new information from such data in an e®ective and e±cient way are the main
challenges which this thesis aims to face. In particular, two main aspects
related to complex data management have been investigated, that are the
way in which complex data can be modeled (i.e., data modeling), and the way
in which homogeneous groups within complex data can be identi¯ed (i.e., data
clustering). The application contexts that have been objective of such studies
are time series data, uncertain data, text data, and biomedical data.
It is possible to illustrate research contributions of this thesis by dividing
them into four main parts, each of which concerns with one speci¯c area and
data type:
vi Abstract
Time Series | A time series representation model has been developed,
which is conceived to support accurate and fast similarity detection. This
model is called Derivative time series Segment Approximation (DSA), as
it achieves a concise yet feature-rich time series representation by com-
bining the notions of derivative estimation, segmentation and segment
approximation.
Uncertain Data | Research in uncertain data mining went into two di-
rections. In a ¯rst phase, a new proposal for partitional clustering has
been de¯ned by introducing the Uncertain K-medoids (UK-medoids) al-
gorithm. This approach provides a more accurate way to handle uncertain
objects in a clustering task, since a cluster representative is an uncertain
object itself (and not a deterministic one). In addition, e±ciency issue has
been addressed by de¯ning a distance function between uncertain objects
that can be calculated o²ine once per dataset.
In a second phase, research activities aimed to investigate issues related
to hierarchical clustering of uncertain data. Therefore, an agglomera-
tive centroid-based linkage hierarchical clustering framework for uncer-
tain data (U-AHC) has been proposed. The key point lies in equipping
such scheme with a more accurate distance measure for uncertain objects.
Indeed, it has been resorted to information theory ¯eld to ¯nd a mea-
sure able to compare probability distributions of uncertain objects used
to model uncertainty.
Text Data |Research results on text data can be summarized in two main
contributions. The ¯rst one regards clustering of multi-topic documents,
and a framework for hard clustering of documents according to their mix-
tures of topics has been proposed. Documents are assumed to be modeled
by a generative process, which provides a mixture of probability mass
functions (pmfs) to model the topics that are discussed within any spe-
ci¯c document. The framework combines the expressiveness of generative
models for document representation with a properly chosen information-
theoretic distance measure to group the documents.
The second proposal concerns distributional clustering of XML documents,
focusing on a the development of a distributed framework for e±ciently
clustering XML documents. The distributed environment consists of a
peer-to-peer network where each node in the network has access to a
portion of the whole document collection and communicates with all the
other nodes to perform a clustering task in a collaborative fashion. The
proposed framework is based on modeling and clustering XML documents
by structure and content. Indeed, XML documents are transformed into
transactional data based on the notion of tree tuple. The framework is
based on the well-known paradigm of centroid-based partitional clustering
to conceive the distributed, transactional clustering algorithm.
Biomedical Data | Research results on time series and uncertain data
have been involved to support e®ective and e±cient biomedical data man-
agement. The focus regarded both proteomics and genomics, investigat-
Abstract vii
ing Mass Spectrometry (MS) data and microarray data. In the speci¯c, a
Mass Spectrometry Data Analysis (MaSDA) system has been de¯ned. The
key idea consists in exploiting temporal information implicitly contained
in MS data and model such data as time series. The major advantages
of this solution are the dimensionality and the noise reduction. As re-
gards micrarray data, U-AHC has been employed to perform clustering of
microarray data with probe-level uncertainty. A strategy to model probe-
level uncertainty has been de¯ned, together with a hierarchical clustering
scheme for analyzing such data. This approach performs a gene-based
clustering to discover clustering solutions that are well-suited to capture
the underlying gene-based patterns of microarray data.
The e®ectiveness and the e±ciency of the proposed techniques in clus-
tering complex data are demonstrated by performing intense and exhaustive
experiments, in which such proposals are extensively compared with the main
state-of-the-art competitors.; Università della CalabriaSoggetto
Ingegneria informatica; Analisi dei cluster
Relazione
ING-INF/05;