Improving Thesis Title Classification Accuracy Using Ensemble Classifier and Modified Chi-Square Feature Selection Method

Authors

  • Ritzkal Universitas Ibn Khaldun Author
  • Wahyu Tisno Atmojo Sistem Informasi, Universitas Pradita Author
  • Panji Novantara Ilmu Komputer, Universitas Kuningan Author
  • Sabir Rosidin Doctoral Program of Information Systems Author
  • Ahmad Dedi Jubaedi Universitas serang raya Author
  • Enggar Novianto Universitas Sebelas Maret Author

Keywords:

Thesis Title Classification, Modified Chi-Square, Feature Selection, Ensemble Learning, Ensemble Classifier, Machine Learning

Abstract

Text classification of academic documents, particularly thesis titles, poses challenges due to high dimensionality, sparsity, and topic heterogeneity. Conventional feature selection techniques, such as the standard Chi-Square, often fall short in capturing discriminative features effectively. This research aims to enhance classification accuracy by proposing a Modified Chi-Square feature selection method that integrates term frequency and class distribution information. The selected features are then classified using ensemble decision tree algorithms, including Random Forest, Gradient Boosting, and XGBoost. Experiments were conducted on a labeled dataset of thesis titles using TF-IDF for vector representation. Evaluation metrics such as accuracy, precision, recall, F1-score, and AUC were used to assess model performance. The results showed that the combination of Modified Chi-Square and XGBoost outperformed other models, achieving the highest accuracy of 93.8% and an AUC of 0.94. These findings demonstrate that the integration of advanced feature selection and ensemble learning techniques can significantly improve academic text classification performance, providing valuable implications for the development of intelligent digital repositories and recommendation systems.

Downloads

Download data is not yet available.

References

[1] C. Jin et al., “Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization,” IETE J. Res., vol. 61, no. 4, pp. 351–362, Jul. 2015, doi: 10.1080/03772063.2015.1021385.

[2] H. Jafarzadeh, M. Mahdianpari, E. Gill, F. Mohammadimanesh, and S. Homayouni, “Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation,” 2021. doi: 10.3390/rs13214405.

[3] C.-W. Chen, Y.-H. Tsai, F.-R. Chang, and W.-C. Lin, “Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results,” Expert Syst., vol. 37, no. 5, p. e12553, Oct. 2020, doi: https://doi.org/10.1111/exsy.12553.

[4] H. Zhang et al., “Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning,” 2024. doi: 10.3390/min14100970.

[5] Y. Zhai, W. Song, X. Liu, L. Liu, and X. Zhao, “A Chi-Square Statistics Based Feature Selection Method in Text Classification,” in 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), 2018, pp. 160–163. doi: 10.1109/ICSESS.2018.8663882.

[6] A. M. Ali, F. Salim, and F. Saeed, “Parkinson’s Disease Detection Using Filter Feature Selection and a Genetic Algorithm with Ensemble Learning,” 2023. doi: 10.3390/diagnostics13172816.

[7] P. K. Sahu and T. Fatma, “Optimized Breast Cancer Classification Using PCA-LASSO Feature Selection and Ensemble Learning Strategies With Optuna Optimization,” IEEE Access, vol. 13, pp. 35645–35661, 2025, doi: 10.1109/ACCESS.2025.3539746.

[8] Achin Jain and Vanita Jain, “Sentiment classification using hybrid feature selection and ensemble classifier,” J. Intell. Fuzzy Syst., vol. 42, no. 2, pp. 659–668, Feb. 2021, doi: 10.3233/JIFS-189738.

[9] A. K. Mandal, M. Nadim, H. Saha, T. Sultana, M. D. Hossain, and E.-N. Huh, “Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search,” IEEE Access, vol. 12, pp. 62341–62357, 2024, doi: 10.1109/ACCESS.2024.3390684.

[10] A. Adel, N. Omar, and A. Al-Shabi, “A COMPARATIVE STUDY OF COMBINED FEATURE SELECTION METHODS FOR ARABIC TEXT CLASSIFICATION,” J. Comput. Sci., vol. 10, no. 11, 2014, doi: 10.3844/jcssp.2014.2232.2239.

[11] S. Krishnaveni, S. Sivamohan, S. Sridhar, and S. Prabhakaran, “Network intrusion detection based on ensemble classification and feature selection method for cloud computing,” Concurr. Comput. Pract. Exp., vol. 34, no. 11, p. e6838, May 2022, doi: https://doi.org/10.1002/cpe.6838.

[12] P. N. Andono and R. A. Pramunendar, “Performance Evaluation of Classification Algorithm for Movie Review Sentiment Analysis,” Int. J. Comput., vol. 22, no. 1, pp. 7–14, 2023, doi: 10.47839/ijc.22.1.2873.

[13] B. A. Prakoso, A. Z. Fanani, I. Riawan, and H. Fajri, “Word Search with Trending Reviews on Twitter,” Ingénierie des Systèmes d’Information, vol. 28, no. 2, pp. 351–356, 2023, [Online]. Available: https://doi.org/10.18280/isi.280210

[14] Z. Sutriawan, Muljono, Khairunnisa, Alamin, T. A. Lorosae, and S. Ramadhan, “Improving Performance Sentiment Movie Review Classification Using Hybrid Feature TFIDF , N-Gram , Information Gain and Support Vector Machine,” Math. Model. Eng. Probl., vol. 11, no. 2, pp. 375–384, 2024.

[15] S. Mutmainnah, T. Ansyor Lorosae, and S. Ramadhan, “Model Text Embedding dan TF-IDF+Ngram untuk Meningkatkan Kinerja Algoritma Binary Classifier pada Klasifikasi SMS Palsu,” J. Sist. Inf. Tgd, vol. 4, no. 1, pp. 55–64, 2025, [Online]. Available: https://ojs.trigunadharma.ac.id/index.php/jsi

[16] R. Maulana, P. A. Rahayuningsih, W. Irmayani, D. Saputra, and W. E. Jayanti, “Improved Accuracy of Sentiment Analysis Movie Review Using Support Vector Machine Based Information Gain,” J. Phys. Conf. Ser., vol. 1641, no. 1, pp. 0–6, 2020, doi: 10.1088/1742-6596/1641/1/012060.

[17] M. Namakin, M. Rouhani, and M. Sabzekar, “An Evolutionary Correlation-aware Feature Selection Method for Classification Problems”.

[18] N. Ghatasheh, I. Altaharwa, and K. Aldebei, “Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization : Case of XGBoost in Spam Prediction,” IEEE Access, no. July, pp. 84365–84383, 2022.

[19] T. A. Gonsalves, “Feature Selection for Text Classification,” Comput. Methods Featur. Sel., pp. 273–292, 2020, doi: 10.1201/9781584888796-23.

[20] H. Li, J. Li, and H. Dietl, “A Novel Decision Making Approach for Benchmarking the Service Quality of Smart Community Health Centers,” IEEE Access, vol. 8, pp. 209904–209914, 2020, doi: 10.1109/ACCESS.2020.3037769.

[21] W. K. Jati and L. Kemas Muslim, “Optimization of Decision Tree Algorithm in Text Classification of Job Applicants Using Particle Swarm Optimization,” in 3rd International Conference on Information and Communications Technology, ICOIACT 2020, Telkom University, School of Computing, Bandung, Indonesia: Institute of Electrical and Electronics Engineers Inc., 2020, pp. 201–205. doi: 10.1109/ICOIACT50329.2020.9332101.

[22] Z.-G. Liu, Y. Liu, J. Dezert, and Q. Pan, “Classification of incomplete data based on belief functions and K-nearest neighbors,” Knowledge-Based Syst., vol. 89, pp. 113–125, 2015, doi: 10.1016/j.knosys.2015.06.022.

[23] Y. Wan and Q. Gao, “An Ensemble Sentiment Classification System of Twitter Data for Airline Services Analysis,” Proc. - 15th IEEE Int. Conf. Data Min. Work. ICDMW 2015, pp. 1318–1325, 2016, doi: 10.1109/ICDMW.2015.7.

[24] D. Teekaraman, S. Sendhilkumar, and G. S. Mahalakshmi, “Semantic Provenance Based Trustworthy Users Classification on Book-Based Social Network using Fuzzy Decision Tree,” Int. J. Uncertainty, Fuzziness Knowldege-Based Syst., vol. 28, no. 1, pp. 47–77, 2020, doi: 10.1142/S0218488520500038.

[25] S. Visa, B. Ramsay, A. Ralescu, and E. Van Der Knaap, “Confusion matrix-based feature selection,” CEUR Workshop Proc., vol. 710, pp. 120–127, 2011.

[26] P. G. Shivakumar and P. Georgiou, “Confusion2Vec: Towards enriching vector space word representations with representational ambiguities,” PeerJ Comput. Sci., vol. 2019, no. 6, 2019, doi: 10.7717/peerj-cs.195.

[27] J. Miharja, J. L. Putra, and N. Hadianto, “Comparison of Machine Learning Classification Algorithm on Hotel Review Sentiment Analysis (Case Study: Luminor Hotel Pecenongan),” J. Pilar Nusa Mandiri, vol. 16, no. 1, pp. 59–64, 2020, doi: 10.33480/pilar.v16i1.1131.

[28] S. Sultana, S. S. Hussain, M. Hashmani, J. Ahmad, and M. Zubair, “A deep learning hybrid ensemble fusion for chest radiograph classification,” Neural Netw. World, vol. 31, no. 3, pp. 199–209, 2021, doi: 10.14311/NNW.2021.31.010.

[29] X. Liu et al., “Adapting Feature Selection Algorithms for the Classification of Chinese Texts,” 2023. doi: 10.3390/systems11090483.

[30] G. Airlangga, “Spam Detection on YouTube Comments Using Advanced Machine Learning Models: A Comparative Study,” Brill. Res. Artif. Intell., vol. 4, no. 2, pp. 500–508, 2024, doi: 10.47709/brilliance.v4i2.4670.

[31] Ö. Şengel, “A comparative analysis of learning techniques in the context of Turkish spam detection,” Batman Üniversitesi Yaşam Bilim. Derg., vol. 14, no. 1, pp. 43–56, 2024, doi: 10.55024/buyasambid.1501609.

[32] B. Ge, C. He, H. Xu, J. Wu, and J. Tang, “Chinese News Text Classification Method via Key Feature Enhancement,” Appl. Sci., vol. 13, no. 9, 2023, doi: 10.3390/app13095399.

[33] Bekir Parlak and Alper Kursat Uysal, “A novel filter feature selection method for text classification: Extensive Feature Selector,” J. Inf. Sci., vol. 49, no. 1, pp. 59–78, Apr. 2021, doi: 10.1177/0165551521991037.

[34] O. P. Ige and K. H. Gan, “Ensemble Filter-Wrapper Text Feature Selection Methods for Text Classification,” C. - Comput. Model. Eng. Sci., vol. 141, no. 2, pp. 1847–1865, 2024, doi: 10.32604/cmes.2024.053373.

[35] Y. Zhuang, Z. Fan, J. Gou, Y. Huang, and W. Feng, “A importance-based ensemble method using an adaptive threshold searching for feature selection,” Expert Syst. Appl., vol. 267, no. January 2024, p. 126152, 2025, doi: 10.1016/j.eswa.2024.126152.

[36] N. Ghatasheh, I. Altaharwa, and K. Aldebei, “Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction,” IEEE Access, vol. 10, no. July, pp. 84365–84383, 2022, doi: 10.1109/ACCESS.2022.3196905.

Downloads

Published

2025-08-17

How to Cite

Improving Thesis Title Classification Accuracy Using Ensemble Classifier and Modified Chi-Square Feature Selection Method. (2025). Indonesian Applied Research Computing and Informatics, 1(1), 37-47. https://jurnal.tdinus.com/index.php/iarci/article/view/52