Overcoming uncertainty and the curse of dimensionality in data clustering
Mostra/ Apri
Creato da
Gullo,Francesco
Greco,Sergio
Palopoli,Luigi
Metadata
Mostra tutti i dati dell'itemDescrizione
Formato
/
Dottorato di Ricerca in Ingegneria dei Sistemi ed Informatica, XXII Ciclo, 2009; Uncertainty and the curse of dimensionality are two crucial problems that
usually a®ect data clustering.
Uncertainty in data clustering may be typically considered at data level
or clustering level. Data level uncertainty is inherently present in the repre-
sentation of several kinds of data objects from various application contexts
(e.g., sensor networks, moving objects databases, biomedicine). This kind of
uncertainty should be carefully taken into account in a clustering task in order
to achieve adequate accuracy; unfortunately, traditional clustering methods
are designed to work only on deterministic vectorial representations of data
objects. Clustering uncertainty is related to the output of any clustering algo-
rithm. Indeed, the ill-posed nature of clustering leads to clustering algorithms
that cannot be generally valid for any input dataset, i.e., their output results
are necessarily uncertain. Clustering ensembles has been recognized as a pow-
erful solution to overcome clustering uncertainty. It aims to derive a single
clustering solution (i.e., the consensus partition) from a set of clusterings rep-
resenting di®erent partitions of the same input dataset (i.e., the ensemble).
A major weakness of the existing clustering ensembles methods is that they
compute the consensus partition by equally considering all the solutions in
the ensemble.
The curse of dimensionality in data clustering concerns all the issues that
naturally arise from data objects represented by a large set of features and are
responsible of poor accuracy and e±ciency achieved by traditional clustering
methods working on high dimensional data. Classic approaches to the curse
of dimensionality include global and local dimensionality reduction. Global
techniques aim at reducing the dimensionality of the input dataset by applying
the same algorithm(s) to all the input data objects. Local dimensionality
reduction acts by considering subsets of the input dataset and performing
dimensionality reduction speci¯c for any of such subsets. Projective clustering
is an e®ective class of methods falling into the category of local dimensionality
reduction. It aims to discover clusters of objects along with the corresponding
subspaces, following the principle that objects in the same cluster are close to
each other if and only if they are projected onto the subspace associated to
that cluster.
viii Abstract
The focus of this thesis is on the development of proper techniques for
overcoming the crucial problems of uncertainty and the curse of dimension-
ality arising from data clustering. This thesis provides the following main
contributions.
Uncertainty. Uncertainty at a representation level is addressed by proposing:
UK-medoids, which is a new partitional algorithm for clustering un-
certain objects, which is designed to overcome e±ciency and accuracy
issues of some existing state-of-the-art methods;
U-AHC, i.e., the ¯rst (agglomerative) hierarchical algorithm for clus-
tering uncertain objects;
a methodology to exploit U-AHC for clustering microarray biomedical
data with probe-level uncertainty.
Clustering uncertainty is addressed by focusing on the problem of weighted
consensus clustering, which aims to automatically determine weighting
schemes to discriminate among clustering solutions in a given ensemble.
In particular:
three novel diversity-based, general schemes for weighting the individ-
ual clusterings in a given ensemble are proposed, i.e., Single-Weighting
(SW), Group-Weighting (GW), and Dendrogram-Weighting (DW);
three algorithms, called WICE, WCCE, and WHCE, are de¯ned to eas-
ily involve clustering weighting schemes into any clustering ensembles
algorithm falling into one of the main classes of clustering ensembles
approaches, i.e., instance-based, cluster-based, and hybrid.
The curse of dimensionality. Global dimensionality reduction is addressed
by focusing on the time series data application context:
the Derivative time series Segment Approximation (DSA) model is
proposed as a new time series dimensionality reduction method de-
signed for accurate and fast similarity detection and clustering;
Mass Spectrometry Data Analysis (MaSDA) system is presented; it
mainly aims at analyzing mass spectrometry (MS) biomedical data by
exploiting DSA to model such data according to a time series-based
representation;
DSA is exploited for pro¯ling low-voltage electricity customers.
Regarding local dimensionality reduction, a uni¯ed view of projective clus-
tering and clustering ensembles is provided. In particular:
the novel Projective Clustering Ensembles (PCE) problem is addressed
and formally de¯ned according two speci¯c optimization formulations,
i.e., two-objective PCE and single-objective PCE;
MOEA-PCE and EM-PCE algorithms are proposed as novel heuristics
to solve two-objective PCE and single-objective PCE, respectively.
Absolute accuracy and e±ciency performance achieved by the proposed
techniques, as well as the performance with respect to the prominent state-
of-the-art methods are evaluated by performing extensive sets of experiments
on benchmark, synthetically generated, and real-world datasets; Università della CalabriaSoggetto
Ingegneria dei sistemi; Sincronizzazione; Estrazione dati
Relazione
ING-INF/05;