UDK 519.254 Doi: 10.31772/2712-8970-2021-22-1-18-31
Constraint handling genetic algorithm for feature engineering in solving classification problems
Denisov M. A., Sopov E. A.
Reshetnev Siberian State University of Science and Technology; 31, Krasnoiarskii Rabochi Prospekt, Krasnoyarsk, 660037, Russian Federation
Feature engineering in machine learning is a promising but still insufficiently studied direction. Creating new feature space from an original set allows to increase accuracy of the machine learning algorithm chosen to solve complex data mining problems. Some existing selection methods are capable of simultaneously increasing accuracy and reducing feature space. The reduction is an urgent task for big data problems. The paper considers a new machine learning approach for solving classification problems based on feature engineering methods. The design of informative features is carried out using extraction and selection methods. Based on the initial data, new sets of characteristics have been created, which include the original characteristics and characteristics obtained by the method of principal components. The choice of an effective subset of informative features is implemented using a genetic algorithm. In order to avoid overfitting and the creation of trivial classifiers, restrictions are imposed on the fitness function of the genetic algorithm, requiring a certain number of features of the original sample, as well as a certain number of features obtained by the principal component method. A comparative analysis of efficiency of the following classification algorithms is carried out: k-nearest neighbors, support vector machine, and a random forest. Efficiency research experiments are carried out by solving applied binary classification problems from the UCI Machine Learning repository of machine learning problems. The macro F1-score was chosen as an efficiency criterion. The results of numerical experiments show that the proposed approach outperforms the solutions obtained using the original data set and the performance of random feature selection (the low bound for the results). Moreover, the accuracy enhancement is obtained for all types of problems (data sets that have more features than values). All results are proved to be statistically significant.
Keywords: feature selection, feature construction, genetic algorithm, constraint optimization
References

1.  Guzella T. S., Caminhas W. M. A review of machine learning approaches to spam filtering. Expert Systems with Applications. 2009, Vol. 36, No. 7, P. 10206–10222. Doi: 10.1016/j.eswa. 2009.02.037.

2.  Ballestar M. T., Grau-Carles P., Sainz J. Predicting customer quality in e-commerce social networks: a machine learning approach. Review of Managerial Science. 2019, Vol. 13, No. 3, P. 589–603. Doi: 10.1007/s11846-018-0316-x.

3.  Bahlmann C., Haasdonk B., Burkhardt H. Online handwriting recognition with support vector machines-a kernel approach. Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition. 2002, P. 49–54. Doi: 10.1109/IWFHR.2002.1030883.

4.  Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine. 2001. Vol. 23, No 1, P. 89–109. Doi: 10.1016/S0933-3657(01)00077-X.

5.  Kouziokas G. N. Machine learning technique in time series prediction of gross domestic product. Proceedings of the 21st Pan-Hellenic Conference on Informatics. 2017, P. 1–2. Doi: 10.1145/ 3139367.3139443.

6.  John G. H., Kohavi R., Pfleger K. Irrelevant features and the subset selection problem. Machine Learning Proceedings. 1994, P. 121–129. Doi: 10.1016/B978-1-55860-335-6.50023-4.

7.  Kira K., Rendell L. A. A practical approach to feature selection. Machine Learning Proceedings. 1992, P. 249–256. Doi: 10.1016/B978-1-55860-247-2.50037-1.

8.  Rendell L., Seshu R. Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence. 1990, Vol. 6, No. 4, P. 247–270. Doi: 10.1111/j.1467-8640. 1990.tb00298.x.

9.  Liu H., Motoda H. Feature extraction, construction and selection: A data mining perspective. Massachusetts : Kluwer Academic Publishers, 1998, 453 p.

10. Duboue P. The Art of Feature Engineering: Essentials for Machine Learning. Cambridge : Cambridge University Press, 2020, 270 p. Doi: 10.1017/9781108671682.

11. Zheng A., Casari A. Feature engineering for machine learning: principles and techniques for data scientists. Sebastopol : O'Reilly Media Inc., 2018, 193 p.

12. Li J., Cheng K., Morstatter F. et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR). 2017, Vol. 50, No. 6, P. 1–45. Doi: 10.1145/3136625.

13. Park M. S., Na J. H., Choi J. Y. PCA-based feature extraction using class information. 2005 IEEE International Conference on Systems, Man and Cybernetics. 2005, Vol. 1, P. 341–345. Doi: 10.1109/ICSMC.2005.1571169.

14. Abdi H., Williams L. J. Principal component analysis. Wiley interdisciplinary reviews: computational statistics. 2010, Vol. 2, No. 4, P. 433–459. Doi: 10.1002/wics.101.

15. Markovitch S., Rosenstein D. Feature generation using general constructor functions. Machine Learning. 2002, Vol. 49, No. 1, P. 59–98. Doi: 10.1023/A:1014046307775.

16. Hirsh H., Japkowicz N. Bootstrapping training-data representations for inductive learning:
A case study in molecular biology.
AAAI-94 Proceedings, 1994, P. 639–644.

17. Sutton R. S., Matheus C. J. Learning polynomial functions by feature construction. Machine Learning Proceedings. 1991, P 208–212. Doi: 10.1016/B978-1-55860-200-7.50045-3.

18. Zhao H., Sinha A. P., Ge W. Effects of feature construction on classification performance:
An empirical study in bank failure prediction. Expert Systems with Applications. 2009, Vol. 36, No. 2, P. 2633–2644. Doi: 10.1016/j.eswa.2008.01.053.

19. Pagallo G. Haussler D. Boolean feature discovery in empirical learning. Machine learning. 1990, Vol. 5, No 1, P. 71–99. Doi: 10.1023/A:1022611825350.

20. Matheus C. J., Rendell L. A. Constructive Induction on Decision Trees. IJCAI'89: Proceedings of the 11th international joint conference on Artificial intelligence. 1989, Vol. 89, P. 645–650.

21. Krawiec K. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines. 2002, Vol. 3, No. 4,
P. 329–343. Doi: 10.1023/A:1020984725014.

22. Smith M. G., Bull L. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines. 2005, Vol. 6, No. 3, P. 265–281. Doi: 10.1007/s10710-005-2988-7.

23. Specia L., Srinivasan A., Sachindra J., et al. An investigation into feature construction to assist word sense disambiguation. Machine Learning. 2009, Vol. 76, No 1, P. 109–136. Doi: 10.1007/ s10994-009-5114-x.

24. Khalid S., Khalil T., Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. 2014 Science and Information Conference. 2014, P. 372–378. Doi: 10.1109/SAI. 2014.6918213.

25. Krivenko M. P. [Significance tests of feature selection for classification]. Informatics and Applications. 2016, Vol. 10, No. 3, P. 32–40. Doi: 10.14357/19922264160305. (In Russ.)

26. Miao J., Niu L. A survey on feature selection. Procedia Computer Science. 2016, Vol. 91,
P. 919–926.
Doi: 10.1016/j.procs.2016.07.111.

27. Chandrashekar G., Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014, Vol. 40, No. 1, P. 16–28. Doi: 10.1016/j.compeleceng.2013.11.024.

28. Xue B., Zhang M., Browne W. et al. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation. 2015, Vol. 20, No. 4, P. 606–626. Doi: 10.1109/TEVC.2015.2504420.

29. Coello C. Theoretical and numerical constraint-handling techniques used with evolutionary algorithms: a survey of the state of the art. Computer methods in applied mechanics and engineering. 2002, Vol. 191, No. 11–12, P. 1245–1287. Doi: 10.1016/S0045-7825(01)00323-1.

30. Barbosa H. J. C., Lemonge A. C. C. An adaptive penalty method for genetic algorithms in constrained optimization problems. Frontiers in Evolutionary Robotics. 2008. Doi: 10.5772/5446.

31. UCI Machine Learning Repository Available at: https://archive.ics.uci.edu/ml/index.php (accessed 09.01.2021).

32. Opitz J., Burst S. Macro f1 and macro f1. Preprint arXiv:1911.03347. Available at: https://arxiv.org/abs/1911.03347 (accessed 25.02.2021).

33. Pedregosa F., Varoquaux G., Gramfort A. et al. Scikit-learn: Machine learning in Python. Journal of machine Learning research. 2011, Vol. 12, P. 2825–2830.

Dong G., Liao G., Liu H, Kuang G. A review of the autoencoder and its variants:
A comparative perspective from target recognition in synthetic-aperture radar images. IEEE Geoscience and Remote Sensing Magazine. 2018. Vol. 6, No. 3, P. 44–68. Doi: 10.1109/MGRS.2018.2853555.


Denisov Maksim Andreevich – postgraduate; Reshetnev Siberian State University of Science and Technology.
E-mail: max_denisov00@mail.ru.

Sopov Evgenii Aleksandrovich PhD (CS), associate professor; Reshetnev Siberian State University of Science and Technology. E-mail: evgenysopov@gmail.com.


  Constraint handling genetic algorithm for feature engineering in solving classification problems