A new content-free approach to identification of document language: Angle patterns

dc.contributor.authorNoyan, Tuba
dc.contributor.authorKuncan, Fatma
dc.contributor.authorTekin, Ramazan
dc.contributor.authorKaya, Yilmaz
dc.date.accessioned2024-12-24T19:30:25Z
dc.date.available2024-12-24T19:30:25Z
dc.date.issued2022
dc.departmentSiirt Üniversitesi
dc.description.abstractGraphical/Tabular Abstract Language identification (LI) in text mining is the process of detecting the natural language in which a document or part of it is written. LI aims to mimic a human's ability to recognize certain languages from text by computer algorithms. LI can be defined as a classification problem subject based on the information used in word or character size for any document. When the literature is examined for LI application, it is seen that various linguistic or statistical-based approaches are used. Linguistic methods are methods that perform LI according to a special word or character of a language. These methods are applied based on the special rules of the languages. When we look at the statistical methods, it shows that the words or characters that make up the language depend on their frequency and distribution. The statistical approaches used are content -independent methods. The semantic context of the text is not concerned with its content. According to linguistic methods, it does not provide sufficient information about the content of the text. The proposed model in this study is a statistical approach. Figure A. Proposed block diagram for LI Purpose: In this study, a new LI approach using the angle information between the UTF-8 values of the characters in the text is proposed. The proposed angle pattern method is used for feature extraction from texts. Angle patterns method is a statistical approach. In the angle method, there are two distance parameters, R and L, which express which neighborhood to look at from the reference point to the left and right. Theory and Methods: To test the proposed approach, four datasets, two created by the authors and two publicly available on the Internet, were used. By using the features obtained by the angle pattern method, classification process was carried out with different machine learning methods such as Random Forest, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes and K-nearest neighbor. Language identification performance results determined from four different data sets were observed as 96.81%, 99.39%, 93.31% and 98.60%, respectively. Results: According to the performance results achieved as a result of the study, it has been determined that the proposed angle pattern method provides important distinguishing information in language identification application. It is thought that the proposed approach in this study can be used in many different text mining applications such as spam recognition, text categorization, as well as LI application.
dc.identifier.doi10.17341/gazimmfd.844700
dc.identifier.endpage1292
dc.identifier.issn1300-1884
dc.identifier.issn1304-4915
dc.identifier.issue3
dc.identifier.scopus2-s2.0-85128730682
dc.identifier.scopusqualityQ2
dc.identifier.startpage1277
dc.identifier.trdizinid508635
dc.identifier.urihttps://doi.org/10.17341/gazimmfd.844700
dc.identifier.urihttps://search.trdizin.gov.tr/tr/yayin/detay/508635
dc.identifier.urihttps://hdl.handle.net/20.500.12604/7524
dc.identifier.volume37
dc.identifier.wosWOS:000834843300012
dc.identifier.wosqualityQ4
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakTR-Dizin
dc.language.isoen
dc.publisherGazi Univ, Fac Engineering Architecture
dc.relation.ispartofJournal of The Faculty of Engineering and Architecture of Gazi University
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_20241222
dc.subjectText-based language identification
dc.subjectNatural language processing
dc.subjectAngle patterns
dc.subjectFeature extraction
dc.titleA new content-free approach to identification of document language: Angle patterns
dc.title.alternativeDoküman dili tanima için içerik ba?imsiz yeni bir yaklasim: Açi örüntüler
dc.typeArticle

Dosyalar