Please use this identifier to cite or link to this item: https://hdl.handle.net/10955/64
Title: Effective Histogram-based Techniques for Summarizing Multi-dimensional Data
Authors: Mazzeo, Giuseppe Massimiliano
Saccà, Domenico
Talia, Domenico
Keywords: compressione dati multidimensionali
OLAP applications
Issue Date: 9-Nov-2012
Series/Report no.: SSD ING-INF/05;
Abstract: This thesis is an effort towards the definition of effective summarization techniques for multi-dimensional data. The state-of-the-art techniques are examviii Preface ined and the issues related to their inability in effectively summarizing multidimensional data are pointed out. The attention is particularly focused on histogram-based summarization techniques, which are the most flexible, the most studied and the most adopted in commercial systems. In particular, hierarchical binary partitions are studied as a basis for effective multi-dimensional histograms, focusing the attention on two aspects which turn out to be crucial for histogram accuracy: the representation model and the strategy adopted for partitioning data into buckets. As regards the former, a very specific spaceefficient representation model is proposed where bucket boundaries are represented implicitly by storing the partition tree. Histograms adopting this representation model (which will be said to be Hierarchical Binary Histograms - HBH) can store a larger number of buckets within a given amount of memory w.r.t. histograms using a “flat” explicit storage of bucket boundaries. On top of that, the introduction of a constraint on the hierarchical partition scheme is studied, allowing each bucket to be partitioned only by splits lying onto a regular grid defined on it: histograms adopting such a constrained partitioning paradigm will be said to be Grid Hierarchical Binary Histograms (GHBH). The introduction of the grid-constrained partitioning of GHBHs can be exploited to further enhance the physical representation efficiency of HBHs. As regards the construction of effective partitions, some new heuristics are introduced, guiding the data summarization by locating inhomogeneous regions of the data where a finer-grain partition is needed. The two physical representation schemes adopted by HBH and GHBH can be viewed as a form of lossless compression to be used on top of the summarization accomplished by histograms (which is a form of lossy compression). The combination of these forms of compression are shown to result in a relevant improvement of histograms effectiveness. On the one hand, the proposed compression-based representation models provide a mechanism for efficiently locating the buckets involved in query estimation, thus reducing the amount of time needed to estimate queries w.r.t. traditional flat representation models. On the other hand, applying lossless compression on top of summarization reduces the loss of information due to summarization, as it enables a larger amount of summary data to be stored within a given storage space bound: this turns out to yield lower error rates of query estimates. By means of experiments, a thorough analysis of different classes of histograms based on hierarchical partitions is provided: the accuracy provided by combining different heuristics (both the new proposals and the “classical” heuristics of two well-known techniques, namely MHIST and Min-Skew) with either the traditional MBR-based representation model or the novel specific tree-based ones (both the unconstrained and the grid-constrained one) is studied. These results provide an insight into the value of compression in the context of histograms based on hierarchical partitions. Interestingly, it is shown that the impact of both HBH and GHBH representation models on the accuracy of query estimates is not simply orthogonal to the adopted heuristic. Thus, the best combination of these different features is identified, Preface ix which turns out from adopting the grid-constrained hierarchical partitioning of GHBHs guided by one of the new heuristics. GHBH is compared with state-of-the-art techniques (MHIST, Min-Skew, GENHIST, as well as other wavelet-based summarization approaches), showing that new technique results in much lower error rates and satisfiable degree of accuracy also at high-dimensionality scenarios. Another important contribution of this thesis consists in the proposal of a new approach for constructing effective histograms. The superiority of GHBH w.r.t. the other histogram-based techniques has been found to depend primarily on the most accurate adopted criterion for guiding the data domain partitioning. In fact, traditional techniques for constructing histograms often yield partitions where dense and sparse regions are put together in the same bucket, thus yielding poor accuracy in estimating queries on summarized data. Despite GHBH adopts a criterion which in theory avoid this situation, there is an intrinsic limit in all the top-down partitioning techniques. That is, histograms which are obtained by iteratively splitting blocks by starting from the block coinciding with the whole data domain, could not have the actual possibility to reach all the dense regions in order to isolate them. In fact, each split yields an increase in the number of bucket and, as the number of buckets is bounded, the number of split that can be performed is bounded as well. Therefore, in large domains, where data are particularly skewed, the number of available splits could not be large enough to reach in a top-down split-sequence all the dense regions. Thus, it could happen that GHBH starts the partitioning of data domain following the correct direction which leads to isolating dense regions, but at a certain point the number of available buckets, and thus of available splits, is saturated. This problem could be avoided by adopting a bottom-up strategy, which first locates the dense region of the data, and then aggregates them into buckets according to some suitable strategy. The problem of searching dense regions is very close to the data clustering problem, that is the problem of grouping database objects into a set of meaningful classes. The enhancement of the histogram construction has been tried by exploiting the capability of clustering techniques to locate dense regions. A new technique, namely CHIST (Clustering-based Histograms), for constructing multidimensional histograms on the basis of a well known density-based clustering algorithm, namely DBSCAN, is proposed. CHIST algorithm first invokes DBSCAN algorithm for partitioning the data into dense and sparse regions, and then further refines this partitioning by adopting a grid-based paradigm. CHIST is compared to GHBH and it is shown to provide lower error rates, especially in “critical” settings, that is when query selectivity is particularly low or the compression ratio is very high. It is worth remarking that in these settings, experiments comparing GHBH to the other techniques showed that GHBH still provides acceptable error rates, while those provided by other techniques are completely unacceptable. CHIST is also extended to the case that data to be summarized are dynamic. In this case, the re-execution of the clustering algorithm at each data update could result prohibitive, due to the high computational cost of this task. Thus, on the basis of Incremental DBSCAN algorithm, a strategy for efficiently propagating data updates to the histogram is proposed. By means of experiments it is shown that the incremental approach, for small updates (i.e., bulk of updates 100 times smaller than the overall data size) can be computed from 100 to 200 times faster than the from-scratch re-computation of the histogram, and the accuracy remains almost unaffected.
Description: Tesi di dottorato di ricerca in Ingegneria dei sistemi ed informatica, XIX Ciclo
URI: http://hdl.handle.net/10955/64
Appears in Collections:Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica - Tesi di Dottorato

Files in This Item:
File Description SizeFormat 
Tesi_Dottorato_Mazzeo.pdf2,51 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.