A new content-free approach to identification of document language: Angle patterns

Noyan, Tuba; Kuncan, Fatma; Tekin, Ramazan; Kaya, Yilmaz

A new content-free approach to identification of document language: Angle patterns

dc.contributor.author	Noyan, Tuba
dc.contributor.author	Kuncan, Fatma
dc.contributor.author	Tekin, Ramazan
dc.contributor.author	Kaya, Yilmaz
dc.date.accessioned	2024-12-24T19:30:25Z
dc.date.available	2024-12-24T19:30:25Z
dc.date.issued	2022
dc.department	Siirt Üniversitesi
dc.description.abstract	Graphical/Tabular Abstract Language identification (LI) in text mining is the process of detecting the natural language in which a document or part of it is written. LI aims to mimic a human's ability to recognize certain languages from text by computer algorithms. LI can be defined as a classification problem subject based on the information used in word or character size for any document. When the literature is examined for LI application, it is seen that various linguistic or statistical-based approaches are used. Linguistic methods are methods that perform LI according to a special word or character of a language. These methods are applied based on the special rules of the languages. When we look at the statistical methods, it shows that the words or characters that make up the language depend on their frequency and distribution. The statistical approaches used are content -independent methods. The semantic context of the text is not concerned with its content. According to linguistic methods, it does not provide sufficient information about the content of the text. The proposed model in this study is a statistical approach. Figure A. Proposed block diagram for LI Purpose: In this study, a new LI approach using the angle information between the UTF-8 values of the characters in the text is proposed. The proposed angle pattern method is used for feature extraction from texts. Angle patterns method is a statistical approach. In the angle method, there are two distance parameters, R and L, which express which neighborhood to look at from the reference point to the left and right. Theory and Methods: To test the proposed approach, four datasets, two created by the authors and two publicly available on the Internet, were used. By using the features obtained by the angle pattern method, classification process was carried out with different machine learning methods such as Random Forest, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes and K-nearest neighbor. Language identification performance results determined from four different data sets were observed as 96.81%, 99.39%, 93.31% and 98.60%, respectively. Results: According to the performance results achieved as a result of the study, it has been determined that the proposed angle pattern method provides important distinguishing information in language identification application. It is thought that the proposed approach in this study can be used in many different text mining applications such as spam recognition, text categorization, as well as LI application.
dc.identifier.doi	10.17341/gazimmfd.844700
dc.identifier.endpage	1292
dc.identifier.issn	1300-1884
dc.identifier.issn	1304-4915
dc.identifier.issue	3
dc.identifier.scopus	2-s2.0-85128730682
dc.identifier.scopusquality	Q2
dc.identifier.startpage	1277
dc.identifier.trdizinid	508635
dc.identifier.uri	https://doi.org/10.17341/gazimmfd.844700
dc.identifier.uri	https://search.trdizin.gov.tr/tr/yayin/detay/508635
dc.identifier.uri	https://hdl.handle.net/20.500.12604/7524
dc.identifier.volume	37
dc.identifier.wos	WOS:000834843300012
dc.identifier.wosquality	Q4
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.indekslendigikaynak	TR-Dizin
dc.language.iso	en
dc.publisher	Gazi Univ, Fac Engineering Architecture
dc.relation.ispartof	Journal of The Faculty of Engineering and Architecture of Gazi University
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/openAccess
dc.snmz	KA_20241222
dc.subject	Text-based language identification
dc.subject	Natural language processing
dc.subject	Angle patterns
dc.subject	Feature extraction
dc.title	A new content-free approach to identification of document language: Angle patterns
dc.title.alternative	Doküman dili tanima için içerik ba?imsiz yeni bir yaklasim: Açi örüntüler
dc.type	Article

Koleksiyon

WOS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu
TR-Dizin İndeksli Yayınlar Koleksiyonu

A new content-free approach to identification of document language: Angle patterns

Dosyalar

Koleksiyon