A novel feature extraction approach for text-based language identification: Binary patterns

[ X ]

Date

2016

Journal Title

Journal ISSN

Volume Title

Publisher

Gazi Universitesi Muhendislik-Mimarlik

Access Rights

info:eu-repo/semantics/closedAccess

Abstract

Language identification (LI), which is a major task in natural language processing, is the process of determining the language from a given content. In this paper, a novel approach, which is based on the probability of the use of the characters that have the similar orders with respect to their UTF-8 values, was proposed. In order to evaluate and validate the proposed approach, four datasets, which contain texts in different numbers of languages, were employed. In the proposed approach, the features that were exacted by one-dimensional local binary pattern (1D-LBP) method were classified by various machine learning methods. Achieved LI accuracies in each of four employed datasets were 86.20%, 92.75%, 100% and 89.77%, respectively. The results showed that the proposed approach yields high success rates and it is an efficient way of language identification.

Description

Keywords

Feature extraction, Natural language processing, One dimensional local binary patterns, Text-based language identification

Journal or Series

Journal of the Faculty of Engineering and Architecture of Gazi University

WoS Q Value

Scopus Q Value

Q2

Volume

31

Issue

4

Citation