Hybrid Data Mining with the Combination of K-Means Algorithm and C4.5 to Predict Student Achievement

certainly

Academic achievements on campus are the main achievements that students must obtain.This achievement is like a class champion.Every parent who finances their children to study in tertiary institutions expects their children to get the highest achievement such as champions in class every semester.This achievement is also an indicator that indicates the success of a student in pursuing education [10].To achieve a champion's achievement is certainly not as easy as turning your palm.But it needs a very big sacrifice, be it time, effort thought, and cost.Many factors can influence students to get achievements.Can be parents' income, distance from the house to campus, number of siblings, completeness of learning tools, student socialization in the campus environment, student socialization in the home environment, GPA, and so on [11], [12].
Many calculation methods can be used to predict whether a student will get an achievement at the end of the semester or not.Predictions are made based on predetermined factors and the value of the student is based on these factors.Two methods are often used, namely the K-Means method [13] and the C4.5 method.The K-Means method is used to classify student data into several groups or clusters while the C4.5 method is used to make predictions.The combination of the two methods will produce a more precise and accurate output [14].The goal of this research is how to predict which number of students among them are predicted to excel at the end of the semester with a combination of the K-Means and C4.5 methods.Besides, the purpose of this study reveals how the K-Means algorithm performs data clustering [15] data on students who will excel or not and how the C4.5 algorithm predicts students who have been grouped [16].
The campus which is carried out as the object of this research is the students of the Putra Indonesia University YPTK Padang.This campus was chosen because there was a policy from the campus that outstanding students would be given an award in the form of free tuition for the next semester and if in the next semester they also excel, they would be allowed to study comparative studies in Malaysia and Singapore.Many previous studies have been carried out with the K-Means and C4.5 methods, including by Wiga Maulana Baihaqi who researched Predicting Sara's Elements in Tweets, further research opportunities are for improvement in the proposed method to obtain more accurate results, both in grouping and classification.Twitter data that contains SARA elements and does not contain SARA elements.Besides.Therefore, it is best to build a system that can be applied to analyze Twitter content.

A. Research Framework
In conducting this research the authors followed the research framework that had been prepared.The following is the research framework that the authors conducted:

B. Research Framework Details
Problem identification is the process of finding out what are the important and main problems in the field that must be solved in this research.Literature study is the process of seeking or studying previous sciences and research related to solving previously identified problems.Data collection is the process of collecting field data that will be used in problem-solving.In this study, the data were sourced from the Academic Bureau (PDE and SISFO) of Putra Indonesia University YPTK Padang and filling out the questionnaire by students.The number of test data records to be tested that have been collected is 50 data records or 50 students.The data collected consists of 8 data attributes as in table 1 Data that has been collected in the field is processed using the K-Means method.The following are the steps for the K-Means method [17], [18], namely: 1. Determine how many groups (clusters) will be created which is called the value k 2. Determine the mean (centroid) random value (random) for each predetermined cluster.
3. Determine the nearest cluster center on each data record with the centroid value using the formula: Information:   (, ) = the distance value for each record with the centroid value,  = 1, 2, 3, ,  = 1, 2, 3,  4. Determine the closest cluster for each data record by comparing the closest distance value that has been obtained previously and then updating the cluster center value (centroid) using the formula: (2) Information:   = The cluster center value,   = Value on each cluster,  = Number of cluster 5. Repeating steps 3 to 5 until there is no data transfer from one cluster to another.
The data that has been collected in the field is not only processed by the K-Means method but also processed using the C4.5 method.Below are the steps for the C4.5 method [19], [20], namely: 1. Select data attributes that will be used as root or prediction nodes in the decision tree and calculate the number of YES and NO values for each data record.
2. Make a branch for each value after obtaining the root of the decision tree by calculating the Gain value using the following formula: Information: Gain (S, A) = total gain value with attributes, Entropy (S) = total entropy value, Entropy (S_i) = Entropy value for each attribute, n = number of clusters While the formula for calculating the entropy value is: The data that has been processed and generated using the K-Means and C4.5 methods are then analyzed and conclusions are drawn from the results of the analysis.How much data from the clustering results and who is included in the cluster have been processed using the K-Means method [21] then the clustering data is analyzed so that it is known which students are predicted to excel (winners) and who It is predicted that they will not perform well (not winning) which have been processed using the C4.5 method [22] among all the processed data.Data processing uses Rapid Miner software version 9.7.002.
The data that has been analyzed and concluded will then be implemented in the field which will later be useful and very helpful for an institution, especially an educational institution in predicting that its students will excel (win) or not (not win).

A. Results of Clustering Data Processing Using the K-Means Algorithm
The initial data that has been collected consisting of 8 attributes and 50 data records can be seen in Table 2 below: Note: • Number 1 in the learning completeness attribute means that the student has incomplete learning tools, number 2 means having complete learning tools, and number 3 means having very complete learning tools.
• The number 0 in the attributes of the student's socialization with the campus environment and the environment in which they live means that the student is not active socializing.Number 1 means that the student is actively socializing.
After processing data on the data in table 2 above with the K-Means method using the Rapid Miner application (Mardalius, 2018), data clustering is obtained.Data clustering is carried out on all data attributes, namely attributes of parents' income, distance from house to campus, completeness of learning tools, socialization of students with the campus environment, socialization of students with their living environment, and cumulative grade point average (GPA).Before grouping, the next step is to convert the data in table 2 into values 1, 2, 3, and 4 using table 3 of the conversion of raw data values below:  e.
Step 5. Repeating steps 3 to step 5 until there is no transfer of data for each row of data from one group to another.
The following table 6 shows the results of grouping data using the K-Means algorithm: To produce data as seen in table 3 and table 4 above, of course, requires a data processing block design using the Rapid Miner application.The block design can be seen in Figure 2 below [23]:

B. Result of Prediction Data Processing Using The C4.5 Algorithm
The data that the researchers collected were 50 rows of data or 50 students who won and did not win the previous year.Data that have been grouped into clusters are given names that describe the type of cluster.In analyzing the data, the C4.5 method.Following in table 7 below the data that has been given the name and the addition of attributes.
Step 1. Determine the data attribute that will be used as the root node or prediction in the decision tree and calculate the number of YES and NO values for each data row.Below, in table 7 the data has been named YES and NO.After naming each data record on each attribute, the prediction results are obtained using the C4.5 method using the Rapid Miner application.Below table 8 shows the prediction results.The prediction result data above can be presented in graphical form (plot view) as shown in Fig. 4 below: To produce data as seen in table 7, table 8, and figure 4, the data processing block design using the Rapid Miner application can be seen in Figure 5  The results of the predictions in Table 8 can be in the form of statistics in the form of data conclusions.Below is the presentation of the data in the form of a description decision tree text view.The shape of the decision tree from the prediction results can be seen in Figure 6 below:

C. Discussion
The equations are an exception to the prescribed specifications of this template.You will need to determine whether or not your equation should be typed using either the Times New Roman or the Symbol font (please no other font).To create multileveled equations, it may be necessary to treat the equation as a graphic and insert it into the text after your paper is styled.Based on the results of the K-Means method clustering data processing using the Rapid Miner application according to the data in table 3, it can be seen that from a total of 50 student data into several clusters based on their respective attributes.This was done so that the data could be processed easily into predictive data using the C4.5 method.using the RapidMiner application.In Fig. 2 is a design view, which is a display block design of clustering data processing for each attribute.Each data attribute is clustered using the K-Means method in the Rapid Miner application.From the far left is the read excel block, then the clustering block using the C4.5 method, then the performance block to see the data capabilities.
Based on the results of processing the predictive data using the C4.5 method using the Rapid Miner application according to the data in table 5, it can be seen that from a total of 50 student data that became testing data, there were 17 students who would excel (Champion) and 17 people who would not with achievement (Champion) there are as many as 33 people who have NO.Thus the predicted achievement (champion) was 34% and the predicted non-achievement (not champion) was 66% of the 50 students who became the testing data.In Fig. 5, we can see the decision tree on the results of the predictions that have been made.The image is obtained from the Rapid Miner menu graph view application.In the decision tree, it can be seen that the root (root) of the decision is the attribute of the cumulative achievement index (GPA) then followed by the attribute of the distance from the house (boarding house) to the campus, then the attributes of the parent's income, then the attribute of the number of siblings, then the attributes of learning tools completeness, attribute of student socialization with the campus environment and finally student socialization with the environment in which they live.

IV. Conclusion and Suggestion
Based on the research that has been done, it can be concluded several things including Data clustering will produce good results if the correct number of k values (number of clusters) is selected.If the number of clusters is too large, the results of the clustering will not be good.In making data predictions, the data should be converted into polynomial data or given a name first according to the data group (cluster) so that the resulting decision tree is easy to see and analyze.The results of the predictions made will be more accurate if the training data entered into the C4.5 method has the same number of records (≥) than the number of data testing records.Of the 100 students whose data were processed, 27 students (27%) were predicted to excel (winners) and 73 students (73%) did not achieve (not winners).
Based on the research that has been done, it can be concluded several things including We recommend that you use more attributes than we have done so that the results of clustering and predicting are better.We recommend that the data attribute for clustering with the K-Means method is in the form of numerical data and then the results are used as polynomial data so that it can be used for the prediction of the C4.5 method.It is recommended that the number of training data records be greater than the number of testing data records so that the predictions made are more precise and accurate.

Fig. 2 .
Fig. 2. Block Design Method K-Means Rapid Miner applicationSo that the data can be seen visually the results of the clustering so that we can find out what the grouping looks like, it can be seen in Figure3below:

Fig 4 .
Fig 4. Graph form (plot view) of the predicted data

Table 2 .
Initial Data

Table 3 .
Converting Raw Data ValuesAfter obtaining the converted data as in table 3 above, the next step is to cluster the data using the K-Means algorithm.The following are the steps for the K-Means algorithm: a. Step 1. Determine the number of data clusters to be created, the number of data clusters is called the value k.The number of data groups determined in this study derived from the data in table 4 is as many as 2 data clusters, namely cluster 1 and cluster 2.b.Step 2. Determine the centroid value randomly for each predetermined group.Below is the centroid value for each cluster is shown in table 5 below: c. Step 3. Determine the closest cluster center on each row of data with the centroid value, to determine this value using the formula:

Table 6
Results of Data Clustering

Table 7 .
Naming Data According to Clusters