Prediction of Scholarship Recipients Using Hybrid Data Mining Method with Combination of K-Means and C4.5 Algorithms

ABSTRACT

student of course also pays several fees (money) to the college where the student is educated.For students who come from well-to-do families or whose parents have a large income, it will be easy to provide the cost of education, while for students who come from underprivileged families or their parents have a small income, it will be difficult to provide the required tuition fees [10], [11].
Many ways can be done for students who have difficulty in making education costs, one of which is scholarships.Many governments, private or private institutions provide scholarships but the number is limited.Not all available scholarships can be enjoyed by all students.Therefore, educational institutions always make a selection or selection among several students who want to get the scholarships that are offered.The method of dredging, storing, or extracting valuable information from a broad data set is known as data mining.The data mining process often uses statistical, mathematical methods to utilize artificial intelligence technology.Knowledge discovery mining in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, and other terms are used to describe data mining.Various algorithms are applied by educational institutions in selecting or selecting students who will receive scholarships.Some do it manually, some do it computerized.In computerization, there are also many algorithms used.Two algorithms that are mostly used by researchers are the K-Means algorithm and the C4.5 algorithm.The combination of the two algorithms will produce a more accurate prediction calculation than using only one algorithm [12], [13].
The K-Means algorithm is one of the algorithms used in data mining techniques to group data, which is also known as clustering.Although the C4.5 algorithm is one of the algorithms used in data mining techniques to forecast or predict data, it is not the only one.After the data from the clustering results of the K-Means algorithm are obtained, it is then used to forecast data using the C4.5 algorithm [14]- [16].The use of two types of algorithms in data mining techniques for one data analysis is known as hybrid data mining.In this research, hybrid data mining was carried out by combining the K-Means algorithm with the C4.5 Algorithm.
The objective of this research is how to predict or predict students who will receive scholarships using the K-Means method and the C4.5 algorithm for computer science students.The aim of this research is to learn how the K-Means algorithm clusters data of potential scholarship recipients and how the C4.5 algorithm forecasts among the clustered data [17], [18].One of the quite large and prominent private universities in Indonesia, especially in the province of West Sumatra, is the Universitas Putra Indonesia YPTK Padang.This university has a faculty that studies computer science called the faculty of computer science (FILKOM).In this study, students who obtained education at the Universitas Putra Indonesia YPTK Padang's computer science faculty in the year 2019 were used as source data.
Many scholarships are available at Universitas Putra Indonesia YPTK Padang's Faculty of Computer Science, including 1st and 2nd-grade winners, Bidik Misi scholarships, BBM scholarships, PPA scholarships, and scholarships from various banks in Indonesia.All available scholarships are distributed evenly to students who are entitled to receive them.For this reason, students who are eligible to receive scholarships are selected.So that students in the future can find out what are the determining factors that determine a scholarship will receive a scholarship, predictive action is needed using data mining techniques, namely a combination of the K-Means and C4.5 algorithms [19].
Previous research that has been conducted and is relevant to this research is by Nurul Rohmawati W et al in 2015 who came from the Singaperbangsa Karawang University, West Java [20].In this study, the results obtained are the k-means algorithm clustering based on the results of clustering from each different dataset format (partial codification, whole codification, and original data) by measuring the accuracy level of clustering, namely calculating the purity measure value of the results.cluster.The greater the purity value (closer to 1) the better the quality of the clusters produced by an algorithm.The weakness in this study is that the purity value in the partially codified data dataset for the k-means algorithm cluster results is 61.11%.
The second research that was carried out was by Iin Parlina in 2018 from Amik Tunas Bangsa Pematangsiantar, North Sumatra [21] which produced clusters by applying several criteria from the SDP program list using the K-Means algorithm.SDP by conducting an advanced assessment center are those who pass and the classification results of the SDP program that almost pass, the classification results of the SDP program data list that do not pass must improve their disciplinary data for 1 year.
The drawback of this study is that the results of the grouping obtained three groups, namely the group passed, almost passed, and did not just pass.• Problem formulation is the process of formulating or determining what are the main and important problems that occur in the field for which a solution must be found in this research.

A. Research Framework
• Literature study is the process of finding or analyzing relevant previous research and library books with predetermined problem-solving.
• Data collection is the process of collecting data in the field that will be used in problemsolving.In this study, the data came from the Faculty of Computer Science, University of Putra Indonesia YPTK Padang which was obtained from the Vice Dean III (WD III).The number of lines of testing data to be tested that has been collected is 100 rows of data, which are 100 students.In grouping (clustering) the data in this study the authors use the K-Means algorithm from the data of 100 people.Then in making predictions using the C4.5 method also uses the same data that has been grouped.The data collected consists of 6 data attributes as in table 1 below: • Data mining is a system of processes for extracting additional value from a set of data and information that cannot be calculated manually.The term "mining" refers to the process of extracting a few useful products from a large number of simple materials.As a result, data mining has a long history in fields like artificial intelligence, machine learning, statistics, and database management.The process of applying this approach to data in order to discover hidden patterns is known as data mining.Data mining may also be described as the process of extracting patterns from large amounts of data.For translating this data into information, data mining is becoming increasingly relevant.It's commonly used in advertisement, surveillance, fraud detection, and scientific discovery, among other things.
There are various algorithms used for data mining including the K-Means algorithm, the C4.5 algorithm, the Support Vector Machine (SVM) algorithm, the a priori algorithm, the expectation-maximization algorithm, and others.In this study, two types of algorithms were combined, namely the K-Means algorithm and the C4.5 algorithm.

• Data Processing Using the K-Means Method
The data that the authors have collected in the field is processed using the K-Means method.
The following are the steps for the K-Means method, namely: 1) Determining the number of data clusters to be created, the number of data clusters is called the value k 2) Determine the mean (centroid value) randomly for each predetermined group.
3) Determine the nearest cluster center on each data row with the centroid value, to determine this value using the formula [22]: Information:   (, ) = distance value on each data row with centroid value, centroid,   =  1 ,  2 ,  3 , … . .,   =  1 ,  2 ,  3 , … . . 4) Determine the closest group (cluster) for each row of data by comparing the closest distance value obtained in the previous step and updating the group's center value using the formula: Information: Cluster Center= The cluster center value, a_i= Value on each cluster, n= number of clusters.
5) Repeating steps 3 to step 5 until there is no transfer of data for each row of data from one group to another.
• Data Processing Using C4.5 Method 1) The results of data processing using the K-Means method are then processed using the C4.5 method.The steps for the C4.5 method are as follows: 2) Determine the data attribute that will be used as the root node or prediction in the decision tree and calculate the number of YES and NO values for each row of data.
3) Determine the branch from the root (root) for each value after determining the root of the decision tree by calculating the Gain value.The gain calculation formula is as follows [23]: Information: (, ) = total gain value with attributes, () = total entropy value, (  ) = Entropy value for each attribute,  = number of clusters.
The formula for calculating the entropy value is: Information: () = total entropy value,  = proportion of   to  4) Divide cases for each existing branch 5) Repeating steps 2 to step 3 for each branch, before the branch's cases all have the same class.
After the data is processed and generated using the K-Means method and the C4.5 method, then it is analyzed and conclusions are drawn.How much data is the result of grouping and who is included in the group previously processed using the K-Means method then the grouping data is analyzed so that it is known which students are predicted to get scholarships and who are predicted not to get scholarships using the C4.5 method among all the data that has been processed.Data processing using Rapid Miner software version 9.7.002.Data analysis results and conclusions are implemented in the field which is very useful and helps institutions, especially educational institutions, in predicting their students will get scholarships and not get scholarships.

A. Results of Clustering Data Processing Using the K-Means Method
The initial data that has been collected consisting of 6 attributes and 100 data records can be seen in Table 2 below: The result of the research can be presented in the form of tables, graphs or figures.They can be compiled with written text to build a discussion of the findings, that is about the new, the modification or the established theory.After processing data on testing data which is the initial data after being collected as in table 2 above with the K-Means method using the Rapid Miner application, data clustering is obtained.Data clustering was carried out on all data attributes, namely attributes of father's income (IDR / month), mother's income (IDR / month), cumulative grade point average (GPA).The attributes of the father's income and mother's income are combined into one attribute, namely parental income (PO) by adding the father's and mother's income then dividing it in half to get the average with the aim of simplifying calculations.After combining the attributes of the father's income and the mother's income into the parent's income, the data is converted into values 1, 2 3, and 4 using table 3 below: After obtaining the converted data as in table 3 above, the next step is to cluster the data using the K-Means algorithm.The steps for the K-Means algorithm are as follows: Step 1. Determine the number of data clusters to be created; this is referred to as the value k.The number of data groups determined in this study derived from the data in table 4 is as many as 2 data clusters, namely cluster 0 and cluster 1.
Step 2. Determine the centroid value randomly for each predetermined group.Below is the centroid value for each cluster is shown in table 5 below: Step 3. Determine the closest cluster center on each row of data with the centroid value, to determine this value using the formula: Step 4. Determine the closest cluster for each row of data by comparing the closest distance values that have been obtained in the previous process then updating the center value of the group using the formula: Step 5. Repeating steps 3 to step 5 until there is no transfer of data for each row of data from one group to another.The following table 6 shows the results of grouping data using the K-Means algorithm: In order to produce data as shown in table 6 above requires a data processing block design using the Rapid Miner application.Fig. 2 illustrates the block architecture: The data that the researchers collected were 100 rows of data or 100 students who received scholarships and did not receive scholarships in the previous year.Data that has been grouped into clusters are given names that describe the type of cluster.The following are the steps for the C4.5 algorithm: Step 1. Determine the data attribute that will be used as the root node or prediction in the decision tree and calculate the number of YES and NO values for each row of data.Below, in table 7 the data that have been named YES and NO and the addition of the attributes of receiving a scholarship.Step 3. Divide the cases for each existing branch.
Step 4. For each branch, repeat steps 2-3 until all cases on the branch have the same class.
Because the calculation of the C4.5 algorithm is repeated with quite several iterations, it requires data mining software to speed up and simplify calculations.The application used is the Rapid Miner application version 9.7.002.Table 8 shows the results of prediction calculations using the C4.5 algorithm.The prediction result data above can be presented in graphical form (plot view) so that you can easily see the prediction results as shown in Fig. 4 below: Processing data on the Rapid Miner application requires a block design that describes the sequence of data processing.To produce data as seen in table 7, table 8, and fig.4, the data processing block design using the Rapid Miner application can be seen in fig. 5 below: The shape of the decision tree from the prediction results can be seen in fig.6 below: Fig. 6.Decision Tree Prediction Method

C. Discussion
Based on the results of the K-Means method clustering data processing using the Rapid Miner application according to the data in table 2 it can be seen that from a total of 100 student data it is made into two clusters, namely cluster 0 and cluster 1. Cluster 0 is defined as a scholarship recipient and cluster 1 is not a scholarship recipient.It can be seen from the output that 32 students did not receive scholarships and 68 students did receive scholarships.
Figure 2 is a design view, which is a block design view of clustering data processing for each attribute.Each data attribute is clustered using the K-Means method in the Rapid Miner application.From the far left is the read excel block which functions as a block for inputting excel data which will be processed then the clustering block which functions to perform data grouping (clustering) which is the performance block that functions to see the data processing capability.
Figure 3 is a visualization view, which is a visual display of data that has been grouped (clustering) using the K-Means method which is visualized in the form of colored dots.The blue dots represent the cluster 1 data group and the green dots are the 0 cluster data groups.The number of points on the graph illustrates the amount of data or students included in the data collected.
Based on the results of predictive data processing using the C4.5 method using the Rapid Miner application, according to the data in table 7, it can be seen that from a total of 100 student data processed, 9 people received a YES score and 91 received a NO.
Figure 4 is a plot view, which shows a point in the form of a scholarship recipient's prediction.The blue plot is the student who is predicted to receive a scholarship and the green plot is the student who is predicted NOT to receive the scholarship.The number of points is the number of scholarship recipients.In the plot, it can be seen that the predicted YES scholarship recipient is 9 people and the predicted NOT the scholarship recipient is 1 person.
In Figure 5 is the C4.5 Method block design in the RapidMiner application.In the block design, it can be seen that the RapidMiner application requires data input in the first block then the data that has been inputted is split into 2 blocks, namely the decision tree block and the application model block which are then combined back into the final block, namely the performance block to determine the accuracy of the data.
In Figure 6, the decision tree can be seen that in predicting the process flow is by training the parents first (PO), namely low, very low, high.If the parents 'income is low, then look at the status of the students' parents whether they still exist or not.If the parents are still there, then see if the GPA is high or low.If high, the student will receive a scholarship.

Fig. 1 .
Fig. 1.Research Framework a. Research Framework.In conducting research, of course, you must follow the rules or a structured and systematic research framework.The research framework can be seen in fig 1. b. Description of the Research Framework.

Fig. 2 .Fig. 3 .
Fig. 2. Block Architecture of K-Means Algorithm in Rapid MinerSo that the data can be seen visually in the form of clustering results so that we can find out what kind of data grouping has been generated, a visualization view is needed as in Fig.3below:

Fig. 5 .
Fig. 5. Block Design C4.5 Algorithm in Rapid MinerThe outcomes of the Table8predictions may take the form of statistics or data conclusions.The data is presented in the form of a summary decision tree text view below.Tree PO = LOW | STATUS = PRESENT | | GPA = LOW: YES {YES=2, NO=0} | | GPA = HIGHT: YES {YES=41, NO=15} | STATUS = NO: YES {YES=11, NO=0}

Table 1
Mardison et.al (Prediction of Scholarship Recipients Using Hybrid Data Mining Method with Combination of K-Means and C4.5 Algorithms)

Table 2 .
Preliminary data

Table 3 .
Initial Data Conversion GuidelinesMardisonBelow are the results of data conversion based on table 3 above as in table 4 below: et.al (Prediction of Scholarship Recipients Using Hybrid Data Mining Method with Combination of K-Means and C4.5 Algorithms)

Table 4 .
Results of Data Conversion

Table 7 .
Naming Data According to Clusters Determine the branch from the root (root) for each value after determining the root of the decision tree by calculating the Gain value.The gain calculation formula is as follows:

Table 8 .
Prediction Result Data