UDK 004.93
FEATURES SELECTION FOR TEXT CLASSIFICATION BASED ON CONSTRAINTS FOR TERM WEIGHTS
R. B. Sergienko*, M. Shan Ur Rehman, A. E. Khan, T. O. Gasanova, W. Minker
Ulm University 43, Albert-Einstein-Allee, Ulm, 89081, Germany *Е-mail: roman.sergienko@uni-ulm.de
Text classification is an important data analysis problem which can be applied in different domains including airspace industry. In this paper different text classification problems such as opinion mining and topic categorization are considered. Different text preprocessing techniques (TF-IDF, ConfWeight, and the Novel TW) and machine learning algorithms for classification (Bayes classifier, k-NN, SVM, and artificial neural network) are applied. The main goal of the presented investigations is to decrease text classification problem dimensionality by using features selection based on constraints for term weights. Such features selection provides significant reduction of dimensionality and less computational time for calculations. Besides, the use of constraints for term weights could increase classification effectiveness. We have observed such increase for three out of five problems. In the remaining two problems, no significant change and a decrease of classification effectiveness was observed.
topic categorization, text classification, opinion mining, features selection, term weighting, constraint.
References
  1. Joachims T. Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers, 2002, p. 205.
  2. Salton G. and Buckley C. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing and Management. 1988, p. 513–523.
  3. Soucy P., Mineau G. W. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005). 2005, p. 1130–1135.
  4. T. Gasanova R. Sergienko W. Minker E. Semenkin, Zhukov E. A Semi-supervised Approach for Natural Language Call Routing. Proceedings of the SIGDIAL 2013 Conference, August 2013, p. 344–348.
  5. Gasanova T., Sergienko R., Akhmedova S., Semenkin E., Minker W. Opinion Mining and Topic Categorization with Novel Term Weighting. ACL 2014. 2014, p. 84.
  6. Gasanova T., Sergienko R., Semenkin E., Minker W. Dimension Reduction with Coevolutionary Genetic Algorithm for Text Classification. Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Vienna University of Technology, Austria, September 2014, vol. 1, p. 215–222.
  7. Potter M. A., De Jong K. A. Cooperative coevolution: an architecture for evolving coadapted subcomponents. Trans. Evolutionary Computation, 8, Jan. 2000, p. 1–29.
  8. Shafait F., Reif M., Kofler C., and Breuel T. M. Pattern Recognition Engineering. RapidMiner Community Meeting and Conference, 2010, p. 9.
  9. DEFT (DÉfi Fouille de Textes). Available at: http://deft.limsi.fr/.
  10. European Language Recourses Association. DEFT’08 Evaluation Package. Available at: http://catalog.elra.info/product_info.php?cPath=42_43&products_id=1165.
  11. Bechet F., Beze M. E., Torres-Moreno J.-M. Proceedings of the 4th DEFT Workshop (Avignon, France, June 8–13, 2008). DEFT '08. TALN, Avignon, France, 2008, p. 27–36.
  12. Charnois T., Doucet A., Mathet Y., Rioult F. Proceedings of the 4th DEFT Workshop (Avignon, France, June 8–13, 2008). DEFT '08. TALN, Avignon, France, 2008, p. 37–46.
  13. Charton E., Camelin N., Acuna-Agost R., Gotab P., Lavalley R., Kessler R., Fernandez S. Proceedings of the 4th DEFT Workshop (Avignon, France, June 8–13, 2008). DEFT '08. TALN, Avignon, France, 2008, p. 47–56.
  14. Cleuziou G., Poudat C. Proceedings of the 4th DEFT Workshop (Avignon, France, June 8–13, 2008). DEFT '08. TALN, Avignon, France, 2008, p. 57–64.
  15. Ishibuchi H., Nakashima T., Murata T. Trans. on Systems, Man, and Cybernetics, 1999, vol. 29, p. 601–618.

Sergienko Roman Borisovich – Cand. Sc., senior researcher of the research group of dialogue systems of the Institute of Communications Engineering, University of Ulm. E-mail: roman.sergienko@uni-ulm.de

Shan Ur Rehman Muhammad – Master’s Degree student, University of Ulm. E-mail: muhammad.shan@uni-ulm.de

Khan Arslan Ehsan – Master’s Degree student of communications technologies, University of Ulm. E-mail: arslan.khan@uni-ulm.de

Gasanova Tatyana Olegovna – Master’s Degree student of applied mathematics and informatics, University of Ulm. E-mail: tatiana.gasanova@uni-ulm.de

Minker Wolfgang – twice Dr. Sc., professor and associative director of the Institute of communication technique, University of Ulm. E-mail: wolfgang.minker@uni-ulm.de