Data-Driven Insights Into Underdeveloped Regencies:  SHAP-Based Explainable Artificial Intelligence Approach

Siskarossa Ika Oktora; Dariani Matualage; Khairil Anwar Notodiputro; Bagus Sartono

doi:10.29099/ijair.v9i1.1399


Data-Driven Insights Into Underdeveloped Regencies: SHAP-Based Explainable Artificial Intelligence Approach

^{(1) *} Siskarossa Ika Oktora

(IPB University; Politeknik Statistika STIS, Indonesia)
⁽²⁾ Dariani Matualage

(IPB University; Universitas Papua, Indonesia)
⁽³⁾ Khairil Anwar Notodiputro

(IPB University, Indonesia)
⁽⁴⁾ Bagus Sartono

(IPB University, Indonesia)
^*corresponding author

Abstract

Classification analysis in high-dimensional data presents significant challenges, particularly due to the presence of complex non-linear patterns that traditional methods, such as logistic regression, fail to capture effectively. This limitation is often reflected in relatively low model accuracy. One approach to addressing this issue is through machine learning-based classification methods, such as Random Forest and Support Vector Machine (SVM). While these models generally achieve higher accuracy than logistic regression, their black-box nature limits interpretability, making it difficult to explain their classification decisions. As machine learning models continue to advance, interpretability has become a crucial concern, especially in data-driven decision-making. Post-hoc explainable artificial intelligence (XAI) techniques offer a viable solution to enhance model transparency. This study applies SHAP to machine learning models to gain insights into the underdevelopment status of regencies in Indonesia. The results indicate that SVM outperforms both logistic regression and Random Forest. SHAP values estimated from SVM, using various permuted variable subsets, exhibit stability. Clustering analysis identifies five optimal clusters of underdeveloped regencies. Based on average SHAP values, underdevelopment alleviation strategies should focus on social factors (Cluster 1), infrastructure (Cluster 2), accessibility (Cluster 3), and a combination of infrastructure, accessibility, education, and healthcare (Cluster 4), while Cluster 5 requires improvements in accessibility and economic conditions.

Keywords

Shapley Additive Explanations; Support Vector Machine; Random Forest; Hierarchical Cluster; PERMANOVA

DOI

https://doi.org/10.29099/ijair.v9i1.1399

Article metrics

10.29099/ijair.v9i1.1399 Abstract views : 735 | PDF views : 149

Cite

How to cite item

Full Text

Download

References

D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied logistic regression. John Wiley & Sons, 2013.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning (no. 1). Springer, 2013.

S. Menard, Applied logistic regression analysis. SAGE publications, 2001.

P. Giudici, A. Gramegna, and E. Raffinetti, "Machine learning classification model comparison," Socio-Economic Planning Sciences, vol. 87, p. 101560, 2023.

L. Velagapudi et al., "A machine learning approach to first pass reperfusion in mechanical thrombectomy: prediction and feature analysis," Journal of Stroke and Cerebrovascular Diseases, vol. 30, no. 7, p. 105796, 2021.

V. Hassija et al., "Interpreting black-box models: a review on explainable artificial intelligence," Cognitive Computation, vol. 16, no. 1, pp. 45-74, 2024.

S. Ali et al., "Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence," Information fusion, vol. 99, p. 101805, 2023.

S. M. Lundberg and S.-I. Lee, "A unified approach to interpreting model predictions," Advances in neural information processing systems, vol. 30, 2017.

X. Ma, M. Hou, J. Zhan, and Z. Liu, "Interpretable predictive modeling of tight gas well productivity with SHAP and LIME techniques," Energies, vol. 16, no. 9, p. 3653, 2023.

S. I. Oktora, " Multivariate Analysis Adaptive Regression Splines (MARS) on Prediction The Underdeveloped District in 2014," Jurnal Aplikasi Statistika & Komputasi Statistik, vol. 7, no. 2, p. 14, 2015, doi: https://doi.org/10.34123/jurnalasks.v7i2.26.

T. Purwandari and Y. Hidayat, "Pemodelan Ketertinggalan Daerah di Indonesia Menggunakan Analisis Diskriminan," Prosiding Konferensi Nasional Penelitian Matematika dan Pembelajarannya, pp. 194-200, 2017.

B. W. Otok, R. Hidayat, Z. Mahsyari, S. H. Sa’diyah, and D. A. Fadhila, "Classification of Underdeveloped Regions in Indonesia Using Decision Tree Method," in Proceedings of the 2nd International Conference Postgraduate School, 2018, pp. 879-883.

T. P. Maulidina and S. I. Oktora, "Analisis Spasial Ketertinggalan Daerah Di Indonesia Tahun 2018 menggunakan Geographically Weighted Logistic Regression," Indonesian Journal of Statistics and Its Applications, vol. 4, no. 3, pp. 528-544, 2020.

R. Lewenussa and R. D. P. Rawi, "Discriminant Study with Classification of Underdeveloped and Developing City Districts in West Papua Province," Ekuilibrium: Jurnal Ilmiah Bidang Ilmu Ekonomi, vol. 15, no. 2, pp. 103-117, 2020.

W. B. Suyanto, Rahmawati Erma Standsyah, Dendy Syahru Ramadhan, "Efforts to Alleviate Underdeveloped Areas by Clustering Regional Characteristics in Indonesia," Jurnal Ekonomi Dan Bisnis Digital, vol. 2, no. 4, pp. 1365–1372, 2024, doi: https://doi.org/10.55927/ministal.v2i4.5531.

P. Biecek and T. Burzykowski, Explanatory model analysis: explore, explain, and examine predictive models. Chapman and Hall/CRC, 2021.

S. Patel, S. Sihmar, and A. Jatain, "A study of hierarchical clustering algorithms," in 2015 2nd international conference on computing for sustainable global development (INDIACom), 2015: IEEE, pp. 537-541.

L. M. Cabezas, R. Izbicki, and R. B. Stern, "Hierarchical clustering: Visualization, feature importance and model selection," Applied Soft Computing, vol. 141, p. 110303, 2023.

D. Pandove, S. Goel, and R. Rani, "General correlation coefficient based agglomerative clustering," Cluster Computing, vol. 22, pp. 553-583, 2019.

R. Tibshirani, G. Walther, and T. Hastie, "Estimating the number of clusters in a data set via the gap statistic," Journal of the royal statistical society: series b (statistical methodology), vol. 63, no. 2, pp. 411-423, 2001.

M. J. Anderson, "Permutational multivariate analysis of variance (PERMANOVA)," Wiley statsref: statistics reference online, pp. 1-15, 2014.

Y. Sun, "Stock selection strategy based on random forest and support vector machine," in Proceedings of the 2020 4th International Conference on Software and e-Business, 2020, pp. 53-56.

K. V. Kumar, P. Malathi, and S. Ramesh, "Performance Analysis of Placement prediction system using Support Vector Machine over Random Forest Algorithm," in 2022 14th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS), 2022: IEEE, pp. 1-4.

R. Gadde and N. S. Kumar, "Analysis and Comparison of Random Forest Algorithm for Prediction of Cardiovascular Disease over Support Vector Machine Algorithm with Improved Precision," Cardiometry, no. 25, pp. 977-982, 2022.

J. Zhang, Q. Sun, J. Liu, L. Xiong, J. Pei, and K. Ren, "Efficient sampling approaches to shapley value approximation," Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1-24, 2023.

E. K. Tokuda, C. H. Comin, and L. d. F. Costa, "Revisiting agglomerative clustering," Physica A: Statistical mechanics and its applications, vol. 585, p. 126433, 2022.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

________________________________________________________

The International Journal of Artificial Intelligence Research

Organized by: Prodi Teknik Informatika Fakultas Teknologi Bisnis dan Sains
Published by: Universitas Dharma Wacana
Jl. Kenanga No. 03 Mulyojati 16C Metro Barat Kota Metro Lampung

Email: jurnal.ijair@gmail.com

View IJAIR Statcounter

This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me