DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization

dc.authoridKURBAN, HASAN/0000-0003-3142-2866
dc.authoridSharma, Parichit/0000-0003-0822-1089
dc.contributor.authorSharma, Parichit
dc.contributor.authorKurban, Hasan
dc.contributor.authorDalkilic, Mehmet
dc.date.accessioned2024-12-24T19:27:45Z
dc.date.available2024-12-24T19:27:45Z
dc.date.issued2022
dc.departmentSiirt Üniversitesi
dc.description.abstractClustering is intractable, so techniques exist to give a best approximation. Expectation Maximization (EM), initially used to impute missing data, is among the most popular. Parameters of a fixed number of probability distributions (PDF) together with the probability of a datum belonging to each PDF are iteratively computed. EM does not scale with data size, and this has hampered its current use. Using a data-centric approach, we insert hierarchical structures within the algorithm to separate high expressive data (HE) from low expressive data (LE): the former greatly affects the objective function at some iteration i, while LE does not. By alternating using either HE or HE+LE, we significantly reduce run-time for EM. We call this new, data-centric EM, EM*. We have designed and developed an R package called DCEM (Data Clustering with Expectation Maximization) to emphasize that data is driving the algorithm. DCEM is superior to EM as we vary size, dimensions, and separability, independent of the scientific domain. DCEM is modular and can be used as either a stand-alone program or a pluggable component. DCEM includes our implementation of the original EM as well. To the best of our knowledge, there is no open source software that specifically focuses on improving EM clustering without explicit parallelization, modified seeding, or data reduction. DCEM is freely accessible on CRAN (Comprehensive R Archive Network). (C) 2021 The Author(s). Published by Elsevier B.V.
dc.identifier.doi10.1016/j.softx.2021.100944
dc.identifier.issn2352-7110
dc.identifier.scopus2-s2.0-85121962934
dc.identifier.scopusqualityQ2
dc.identifier.urihttps://doi.org/10.1016/j.softx.2021.100944
dc.identifier.urihttps://hdl.handle.net/20.500.12604/6779
dc.identifier.volume17
dc.identifier.wosWOS:000769008600025
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherElsevier
dc.relation.ispartofSoftwarex
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_20241222
dc.subjectData centric machine learning
dc.subjectBig data
dc.subjectUnsupervised clustering
dc.subjectExpectation Maximization
dc.subjectOpen source software
dc.titleDCEM: An R package for clustering big data via data-centric modification of Expectation Maximization
dc.typeArticle

Dosyalar