Two new feature selection metrics for text classification


Şahin D. Ö., Kılıç E.

AUTOMATIKA, cilt.60, sa.2, ss.162-171, 2019 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 60 Sayı: 2
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1080/00051144.2019.1602293
  • Dergi Adı: AUTOMATIKA
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.162-171
  • Anahtar Kelimeler: Text classification, text mining, feature selection, term selection, STATISTICAL INTERPRETATION, TERM SPECIFICITY, CATEGORIZATION, RECOGNITION, ALGORITHM
  • Ondokuz Mayıs Üniversitesi Adresli: Evet

Özet

Obtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in the success of classification because of data size. To determine the right feature selection methods for text classification is the main purpose of this study. Metrics that are used frequently for feature selection like Chi-square and Information Gain were applied over different data sets and performance was measured. In this study two feature selection metrics, which are based on filtration, are recommended as alternatives to the current ones. The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification. The second one is the alternative Accuracy2 metric, which was obtained by changing the parameters of Accuracy2 metric. It was observed that the suggested Relevance Frequency Feature Selection and Alternative Accuracy2 metrics offer successful results as the current metrics used frequently.