519.254
Doi: 10.31772/2712-8970-2021-22-1-18-31
Constraint handling genetic algorithm for feature engineering in solving classification problems
Denisov M. A., Sopov E. A.
Reshetnev Siberian State University of Science and Technology;
31, Krasnoiarskii Rabochi Prospekt, Krasnoyarsk, 660037, Russian Federation
Feature engineering in machine learning is a promising but still insufficiently studied direction. Creating new feature space from an original set allows to increase accuracy of the machine learning algorithm chosen to solve complex data mining problems. Some existing selection methods are capable of simultaneously increasing accuracy and reducing feature space. The reduction is an urgent task for big data problems.
The paper considers a new machine learning approach for solving classification problems based on feature engineering methods. The design of informative features is carried out using extraction and selection methods. Based on the initial data, new sets of characteristics have been created, which include the original characteristics and characteristics obtained by the method of principal components. The choice of an effective subset of informative features is implemented using a genetic algorithm. In order to avoid overfitting and the creation of trivial classifiers, restrictions are imposed on the fitness function of the genetic algorithm, requiring a certain number of features of the original sample, as well as a certain number of features obtained by the principal component method. A comparative analysis of efficiency of the following classification algorithms is carried out: k-nearest neighbors, support vector machine, and a random forest. Efficiency research experiments are carried out by solving applied binary classification problems from the UCI Machine Learning repository of machine learning problems. The macro F1-score was chosen as an efficiency criterion.
The results of numerical experiments show that the proposed approach outperforms the solutions obtained using the original data set and the performance of random feature selection (the low bound for the results). Moreover, the accuracy enhancement is obtained for all types of problems (data sets that have more features than values). All results are proved to be statistically significant.
Keywords: feature selection, feature construction, genetic algorithm, constraint optimization
1. Guzella T. S., Caminhas W. M. A
review of machine learning approaches to spam filtering. Expert Systems with
Applications. 2009, Vol. 36, No. 7, P. 10206–10222. Doi: 10.1016/j.eswa.
2009.02.037.
2. Ballestar M. T., Grau-Carles P.,
Sainz J. Predicting customer quality in e-commerce social networks: a machine learning approach. Review
of Managerial Science. 2019, Vol. 13, No. 3, P. 589–603. Doi:
10.1007/s11846-018-0316-x.
3. Bahlmann C., Haasdonk B., Burkhardt
H. Online handwriting recognition with support vector machines-a kernel
approach. Proceedings Eighth International Workshop on Frontiers in
Handwriting Recognition. 2002, P. 49–54. Doi:
10.1109/IWFHR.2002.1030883.
4. Kononenko I. Machine learning for
medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine. 2001. Vol. 23, No 1, P. 89–109. Doi:
10.1016/S0933-3657(01)00077-X.
5. Kouziokas G. N. Machine learning
technique in time series prediction of gross domestic product. Proceedings
of the 21st Pan-Hellenic Conference on Informatics. 2017, P. 1–2. Doi:
10.1145/ 3139367.3139443.
6. John G. H., Kohavi R., Pfleger K.
Irrelevant features and the subset selection problem. Machine
Learning Proceedings. 1994, P. 121–129. Doi:
10.1016/B978-1-55860-335-6.50023-4.
7. Kira K., Rendell L. A. A practical
approach to feature selection. Machine Learning Proceedings.
1992, P. 249–256. Doi: 10.1016/B978-1-55860-247-2.50037-1.
8. Rendell L., Seshu R. Learning hard
concepts through constructive induction: Framework and rationale. Computational
Intelligence. 1990, Vol. 6, No. 4, P. 247–270. Doi: 10.1111/j.1467-8640. 1990.tb00298.x.
9. Liu H., Motoda H. Feature
extraction, construction and selection: A data mining perspective. Massachusetts
: Kluwer Academic Publishers, 1998, 453 p.
10. Duboue P. The Art of Feature
Engineering: Essentials for Machine Learning. Cambridge : Cambridge
University Press, 2020, 270 p. Doi: 10.1017/9781108671682.
11. Zheng A., Casari A. Feature
engineering for machine learning: principles and techniques for data
scientists. Sebastopol : O'Reilly Media Inc., 2018, 193 p.
12. Li J., Cheng K., Morstatter F. et
al. Feature selection: A data perspective. ACM Computing Surveys (CSUR).
2017, Vol. 50, No. 6, P. 1–45. Doi: 10.1145/3136625.
13. Park M. S., Na J. H., Choi J. Y.
PCA-based feature extraction using class information. 2005 IEEE
International Conference on Systems, Man and Cybernetics. 2005, Vol. 1, P.
341–345. Doi: 10.1109/ICSMC.2005.1571169.
14. Abdi H., Williams L. J. Principal
component analysis. Wiley interdisciplinary reviews: computational
statistics. 2010, Vol. 2, No. 4, P. 433–459. Doi: 10.1002/wics.101.
15. Markovitch S., Rosenstein D. Feature
generation using general constructor functions. Machine Learning.
2002, Vol. 49, No. 1, P. 59–98. Doi: 10.1023/A:1014046307775.
16. Hirsh H., Japkowicz N. Bootstrapping
training-data representations for inductive learning:
A case study in molecular biology. AAAI-94 Proceedings, 1994, P.
639–644.
17. Sutton R. S., Matheus C. J. Learning
polynomial functions by feature construction. Machine Learning
Proceedings. 1991, P 208–212. Doi: 10.1016/B978-1-55860-200-7.50045-3.
18. Zhao H., Sinha A. P., Ge W. Effects
of feature construction on classification performance:
An empirical study in bank failure prediction. Expert Systems with
Applications. 2009, Vol. 36, No. 2, P. 2633–2644. Doi:
10.1016/j.eswa.2008.01.053.
19. Pagallo G. Haussler D. Boolean
feature discovery in empirical learning. Machine learning. 1990,
Vol. 5, No 1, P. 71–99. Doi: 10.1023/A:1022611825350.
20. Matheus C. J., Rendell L. A.
Constructive Induction on Decision Trees. IJCAI'89: Proceedings of the 11th
international joint conference on Artificial intelligence. 1989,
Vol. 89, P. 645–650.
21. Krawiec K. Genetic programming-based
construction of features for machine learning and knowledge discovery tasks. Genetic
Programming and Evolvable Machines. 2002, Vol. 3, No. 4,
P. 329–343. Doi: 10.1023/A:1020984725014.
22. Smith M. G., Bull L. Genetic
programming with a genetic algorithm for feature construction and selection. Genetic
Programming and Evolvable Machines. 2005, Vol. 6, No. 3, P. 265–281. Doi:
10.1007/s10710-005-2988-7.
23. Specia L., Srinivasan A., Sachindra
J., et al. An investigation into feature construction to assist word sense
disambiguation. Machine Learning. 2009, Vol. 76, No 1, P.
109–136. Doi: 10.1007/ s10994-009-5114-x.
24. Khalid S., Khalil T., Nasreen S. A
survey of feature selection and feature extraction techniques in machine
learning. 2014 Science and Information Conference. 2014, P. 372–378.
Doi: 10.1109/SAI. 2014.6918213.
25. Krivenko M. P. [Significance tests
of feature selection for classification]. Informatics and
Applications. 2016, Vol. 10, No. 3, P. 32–40. Doi: 10.14357/19922264160305.
(In Russ.)
26. Miao J., Niu L. A survey on feature
selection. Procedia Computer Science. 2016, Vol. 91,
P. 919–926. Doi: 10.1016/j.procs.2016.07.111.
27. Chandrashekar G., Sahin F. A survey
on feature selection methods. Computers & Electrical Engineering.
2014, Vol. 40, No. 1, P. 16–28. Doi: 10.1016/j.compeleceng.2013.11.024.
28. Xue B., Zhang M., Browne W. et al. A
survey on evolutionary computation approaches to feature selection. IEEE
Transactions on Evolutionary Computation. 2015, Vol. 20, No. 4, P. 606–626.
Doi: 10.1109/TEVC.2015.2504420.
29. Coello C. Theoretical and numerical
constraint-handling techniques used with evolutionary algorithms: a survey of
the state of the art. Computer methods in applied mechanics and engineering.
2002, Vol. 191, No. 11–12, P. 1245–1287. Doi: 10.1016/S0045-7825(01)00323-1.
30. Barbosa H. J. C., Lemonge A. C. C.
An adaptive penalty method for genetic algorithms in constrained optimization
problems. Frontiers in Evolutionary Robotics. 2008. Doi:
10.5772/5446.
31. UCI Machine Learning Repository
Available at: https://archive.ics.uci.edu/ml/index.php (accessed 09.01.2021).
32. Opitz J., Burst S.
Macro f1 and macro f1. Preprint arXiv:1911.03347. Available at:
https://arxiv.org/abs/1911.03347 (accessed 25.02.2021).
33. Pedregosa F., Varoquaux G., Gramfort
A. et al. Scikit-learn: Machine learning in Python. Journal of machine
Learning research. 2011, Vol. 12, P. 2825–2830.
Dong G., Liao G., Liu
H, Kuang G. A review of the autoencoder and its variants:
A comparative perspective from target
recognition in synthetic-aperture radar images. IEEE Geoscience
and Remote Sensing Magazine. 2018. Vol. 6, No. 3, P. 44–68. Doi:
10.1109/MGRS.2018.2853555.