Author : C.Deepa 1
Date of Publication :22nd February 2021
Abstract: The rapid growth of electronic documents with large scale and high dimensionality is challenging task as data is unstructured and it required more time and much effort to cluster those documents in many domains. Many clustering algorithms using machine learning algorithm have been developed to address those documents with a very large sample size or with a very high number of dimensions, but they are often impractical and great challenge when the data is large in both aspects and leads to curse of dimensionality, data noise, data sparsity and data scalability issues as it effects the effectiveness and efficiency. Data transformation using Heuristic and hybrid technique have proposed to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of data on distance between data points. In this paper, an extensive study on machine learning techniques on employing hybrid and ensemble model to handle large scale high dimensional data on aspects of data pre-processing, dimensionality reduction, feature selection and feature extraction and finally clustering has been estimated in detail. The clustering algorithm majorly classified as partition-based clustering, Kernel based clustering hierarchical based clustering, Density based clustering and subspace clustering. These analyses provide the solution for fast data-space reduction and an intelligent sampling to cluster the data effectively on various objective functions and optimal solutions configurations to alleviate mentioned issues. Experimental analysis on machine learning based data clustering model on multiple setting has been carried out on the various data sets using performance metric such as Euclidean distance, accuracy, execution time and silhouette index.
Reference :
-
- L. Parsons, E. Haque, and H. Liu, “Subspace clustering for high dimensional data: a review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90–105, 2004.
- T. Shi and S. Horvath, “Unsupervised learning with random forest predictors,” Journal of Computational and Graphical Statistics, vol. 15, no. 1, 2006.
- B. Azarnoush, J. M. Bekki, G. C. Runger, B. L. Bernstein, and R. K. Atkinson, “Toward a framework for learner segmentation,” Journal of Educational Data Mining, vol. 5, no. 2, pp. 102–126, 2013.
- Q. Zhang and I. Couloigner, “A new and efficient kmedoid algorithm for spatial clustering,” Computational Science and Its Applications–ICCSA 2005, pp. 207–224, 2005.
- H.-S. Park and C.-H. Jun, “A simple and fast algorithm for kmedoids clustering,” Expert systems with applications, vol. 36, no. 2, pp. 3336–3341, 2009.
- J. Ji,W. Pang, C. Zhou, X. Han, and Z.Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Knowledge-Based Systems, vol. 30, pp. 129–135, 2012.
- Z. He, X. Xu, and S. Deng, “Attribute value weighting in k-modes clustering,” Expert Systems with Applications, vol. 38, pp. 15 365– 15 369, 2011.
- N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic, “The role of hubness in clustering highdimensional data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 739–751, 2014
- Sangdi Lin, Bahareh Azarnoush, George C. Runger"CRAFTER: a Tree-ensemble Clustering Algorithm for Static Datasets with Mixed
- Attributes and High Dimensionality"IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.5 , NO. 8, 2017
- Punit Rathore and Dheeraj Kumar" A Rapid Hybrid Clustering Algorithm for Large Volumes of High Dimensional Data "IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol.12, issue.14, 2019
- C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. 26th ACM SIGMOD Int’l Conf. Management of Data, pp. 70-81, 2000.
- Kaban, “Non-Parametric Detection of Meaningless Distances in High Dimensional Data,” Statistics and Computing, vol. 22, no. 2, pp. 375-385, 2012.