A new content-free approach to identification of document language: Angle patterns
dc.contributor.author | Noyan, Tuba | |
dc.contributor.author | Kuncan, Fatma | |
dc.contributor.author | Tekin, Ramazan | |
dc.contributor.author | Kaya, Yilmaz | |
dc.date.accessioned | 2024-12-24T19:30:25Z | |
dc.date.available | 2024-12-24T19:30:25Z | |
dc.date.issued | 2022 | |
dc.department | Siirt Üniversitesi | |
dc.description.abstract | Graphical/Tabular Abstract Language identification (LI) in text mining is the process of detecting the natural language in which a document or part of it is written. LI aims to mimic a human's ability to recognize certain languages from text by computer algorithms. LI can be defined as a classification problem subject based on the information used in word or character size for any document. When the literature is examined for LI application, it is seen that various linguistic or statistical-based approaches are used. Linguistic methods are methods that perform LI according to a special word or character of a language. These methods are applied based on the special rules of the languages. When we look at the statistical methods, it shows that the words or characters that make up the language depend on their frequency and distribution. The statistical approaches used are content -independent methods. The semantic context of the text is not concerned with its content. According to linguistic methods, it does not provide sufficient information about the content of the text. The proposed model in this study is a statistical approach. Figure A. Proposed block diagram for LI Purpose: In this study, a new LI approach using the angle information between the UTF-8 values of the characters in the text is proposed. The proposed angle pattern method is used for feature extraction from texts. Angle patterns method is a statistical approach. In the angle method, there are two distance parameters, R and L, which express which neighborhood to look at from the reference point to the left and right. Theory and Methods: To test the proposed approach, four datasets, two created by the authors and two publicly available on the Internet, were used. By using the features obtained by the angle pattern method, classification process was carried out with different machine learning methods such as Random Forest, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes and K-nearest neighbor. Language identification performance results determined from four different data sets were observed as 96.81%, 99.39%, 93.31% and 98.60%, respectively. Results: According to the performance results achieved as a result of the study, it has been determined that the proposed angle pattern method provides important distinguishing information in language identification application. It is thought that the proposed approach in this study can be used in many different text mining applications such as spam recognition, text categorization, as well as LI application. | |
dc.identifier.doi | 10.17341/gazimmfd.844700 | |
dc.identifier.endpage | 1292 | |
dc.identifier.issn | 1300-1884 | |
dc.identifier.issn | 1304-4915 | |
dc.identifier.issue | 3 | |
dc.identifier.scopus | 2-s2.0-85128730682 | |
dc.identifier.scopusquality | Q2 | |
dc.identifier.startpage | 1277 | |
dc.identifier.trdizinid | 508635 | |
dc.identifier.uri | https://doi.org/10.17341/gazimmfd.844700 | |
dc.identifier.uri | https://search.trdizin.gov.tr/tr/yayin/detay/508635 | |
dc.identifier.uri | https://hdl.handle.net/20.500.12604/7524 | |
dc.identifier.volume | 37 | |
dc.identifier.wos | WOS:000834843300012 | |
dc.identifier.wosquality | Q4 | |
dc.indekslendigikaynak | Web of Science | |
dc.indekslendigikaynak | Scopus | |
dc.indekslendigikaynak | TR-Dizin | |
dc.language.iso | en | |
dc.publisher | Gazi Univ, Fac Engineering Architecture | |
dc.relation.ispartof | Journal of The Faculty of Engineering and Architecture of Gazi University | |
dc.relation.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
dc.rights | info:eu-repo/semantics/openAccess | |
dc.snmz | KA_20241222 | |
dc.subject | Text-based language identification | |
dc.subject | Natural language processing | |
dc.subject | Angle patterns | |
dc.subject | Feature extraction | |
dc.title | A new content-free approach to identification of document language: Angle patterns | |
dc.title.alternative | Doküman dili tanima için içerik ba?imsiz yeni bir yaklasim: Açi örüntüler | |
dc.type | Article |