Tutorial: Clustering Techniques for Larges Data Sets

From the Past To the Future

Authors: Alexander Hinneburg, Daniel A. Keim

Abstract: Because of the fast technological progress, the amount of information which is stored in databases is rapidly increasing. In addition, new applications require the storage and retrieval of complex multimedia objects which are often represented by high-dimensional feature vectors. Finding the valuable information hidden in those databases is a difficult task. Cluster analysis is one of the basic techniques which is often applied in analyzing large data sets. Originating from the area of statistics, most cluster analysis algorithms have originally been developed for relatively small data sets. In the recent years, the clustering algorithms have been extended to efficiently work on large data sets, and some of them even allow the clustering of high-dimensional feature vectors. Many such methods use some kind of an index structure for an efficient retrieval of the required data; other approaches are based on preprocessing for a more efficient clustering.

The main goal of the tutorial is to provide an overview of the state-of-the-art in cluster discovery methods for large databases, covering well-known clustering methods from related fields such as statistics, pattern recognition, and machine learning, as well as database techniques which allow them to work efficiently on large databases. The target audience of the tutorial are researchers and practitioners from statistics, databases, and machine learning, who are interested in the state-of-the art of cluster discovery methods and their applications to large databases. The tutorial especially addresses people from academia who are interested in developing new cluster discovery algorithms, and people from industry who want to apply cluster discovery methods in analyzing large databases.

Tutorial Notes, pdf (8 MB), pdf.gz (3 MB)

Collection of interesting papers.

Alexander Hinneburg: 25. Sept. 2000