Estimation System For Late Payment Of School Tuition Fees

The Surakarta Al-Islam Vocational School is a private educational institution that requires all students to pay school tuition fees. Education is an obligation for all Indonesian citizens. The cost of education is one of the most important input components in implementing education. Because cost is the main requirement in achieving educational goals. SPP School is a routine school fee that is carried out every month. Based on last year's School Admin report, many students were late in paying school tuition fees, around 60%. This is a very big problem because the income of school funds comes from school tuition. The purpose of this research is that the researcher will build a prediction system using the best classification method, which is to compare the accuracy level of the Naïve Bayes method with the K-K-Nearest Neighbor method. Because both methods can make class classifications right or late, in paying school fees. processing using dapodic data for 2017/2018 as many as 236 data. In improving accuracy, the researcher also applies feature selection with Information Gain, which is useful for selecting optimal parameters. System testing is carried out using the Confusion Matrix method. The final results of this study indicate that the Naïve Bayes Method + Information Gain Method produces the highest accuracy, namely 95% compared to the Naïve Bayes method alone, namely 85% and the K-NN method, namely 81%.


I. INTRODUCTION
Attending the basic education is an obligation for every Indonesian citizen. This is stated in a statutory regulation in article 31 paragraph (1) of the 1945 Constitution and Minister of Education and Culture Regulation No. 19 of 2016 about the Smart Indonesia Program [1]. Meanwhile, the financial cost is one of the important components in the process of implementing education [2]. One of the financing sources comefrom the Education Development Donation or simply called tuition fee [3]. SMK Al-Islam Surakarta is one of the private educational institutions of the Al-Islam Surakarta Foundation which focuses on teaching Information Technology and Islamic Sciences. In financing their operational of the school, they mostly chargedit to students, especially in school tuition payments that has to paid monthly.
The problem that often arises in SMK Al-Islam Surakarta is when many students are late in paying school tuition fees. This is a serious problem because school tuition fees are one of the main sources of funds in improving the quality of school education. A tuition fee is used to cover operational costs including the salaries of teachers and employees. Based on data from the financial administration section, that in the 2017/2018 school year there were 60% of students who were late in paying school tuition. To overcome the problem of delay, it is necessary to predict students who have the potential to be late in making payments so that the school can take anticipatory action.
There are several predictive algorithms that can be used including Naïve Bayes and K-Nearest Neighbor (K-NN). Both algorithms are included in the Top 10 algorithms in data mining [4]. Research on the use of the Naive Bayes algorithm has been done to predict the level of smooth credit card payment and the student graduation rate on time [5] [6]. K-NN has been used in many classification processes including the classification process of skin conditions, emotion recognition from multichannel EEG signals and Gene expression cancer classification [7]. Algorithm comparison has also been done in several studies to compare Naïve Bayes, K-NN with other algorithms [8].
This study will compare the accuracy of the Naïve Bayes algorithm and the K-NN algorithm in predicting the accuracy of tuition fee payments [9]. After each algorithm is ran by using all of available parameters, the experiment also compared how if the algorithm ran with a specific selected parameter usinga feature selection technique namely Information Gain [10]. Research that has utilized information gain to select features includes the research of classifying analytical sentiment documents process [11]. The best algorithm from this study will be applied in the system of late payment prediction of school tuition [12].

II. RESEARCH METHOD
The data used in this study are sourced from official school Education Based Dataof the 2017/2018 school year and school tuition fee payment transaction reports.The attributes of the data to be used as determinant variables are Parent Income, Family Dependent, Father's Education, Father's Age, Mother's Education and Mother's Age [13].
The total amounts of data record taken were 236. The data records obtained arethen splattedintoTraining Data and TestingData. The training data is used to create theprediction knowledge model with the observed methods Naive Bayes and K-NN, while testing data is used to determine the level of accuracy of the model. From the total amounts of available data, the proportion between training data and testing data will be 75% and 25% respectivelyaccording to the proportion reference in data usage. In other words, from 236 total data, the number of training data and testing data are177 and 59 respectively [14].
The study also combines the Naive Bayes and K-NN methods with Information Gain and compare them alltogether to get the best result. The combinations will be using the Naïve Bayes algorithm with all variables, using Naïve Bayes algorithm with selected variables from Information Gain, using K-NN algorithm with all variables, and using K-NN algorithm with selected variables from Information Gain.
The models created by each method are then tested with 4-fold cross validation that is the testing in each method is carried out 4 times with a combination of different training data and testing data as shown in Table 1. The prediction performance of testing data in each fold will be calculated using Confusion matrix. The matrix produces the value of accuracy, precision and recall. Each of these values will be averaged over the entire fold. The average results of accuracy, precision, recall and F-Measure of each method were compared to determine the best method in predicting data of overdue payment of school tuition.

Naïve Bayes Method
Naïve Bayes algorithm is one of popular classification methods using probability and statistics. The basic form of Bayes methods can be shown as Equation 1. The probability A as B, is obtained from probability B when A multiplied by probability of A and divided by probability of B. The use of Naive Bayes on a data with more than one feature / attribute, causing Equation 1 to be more complex as shown in Equation 2 ( The value of P ( 1 … ) is constant for each experiment so that the maximum value of a class is determined by the maximum value between P (A) P ( 1 … | A) [15]. The equation function formed becomes a maximum multiplication for the prior value and the likelihood function, the function is shown in Equation 3.

K-NN (Nearest Neighbor) Method
K-Nearest Neighbor (K-NN) algorithm is a well known algorithm in machine learning area, because its easy and simple process. In the K-NN algorithm all available data must have a label to identify the closest data in the comparison process.The working principle of K-NN is to find the closest distance between the evaluated data with the nearest K-Neighbor in the training data [16]. The steps of the K-NN algorithm are: a. Determine parameter K b. Calculate the distance between testing data and training data. If the data is numeric, then we use Euclidean distance as shown in Equation 4.
With: Xi = training data Yi = testing data D (xi, yi) = distance i = variable data n = dimension data c. Sort all the distances descending. d. Select the closest distance to parameter k e. Select the highest amount of class then classify

Variables Selection with Information Gain
Information Gain is the simplest feature selection method by making attributes ranking and widely used in text categorization applications, microarray data analysis and image data analysis [17]. Information Gain can help reduce noise caused by irrelevant features. Information Gain detects features that have the most information based on a particular class [18]. Determining the best attribute is startedby calculating the entropy value. Entropy is a measure of class uncertainty using the probability of particular events or attributes [19]. The formula for calculating entropy is shown in Equation 5. After the entropy value obtained, the Information Gain calculation can be done using Equation 6.
With c is the number of values in the classification class and Pi is the number of samples for class i.

Confusion Matrix Testing
Accuracy calculations in data mining can be done by entering a set of testing data into the data mining model and comparing the classification values produced by the model as a prediction to the actual value in the test data. Simple classification for a prediction usually consists of two classes, which indicate that the main observation target or event is occurred(Positive) or not-occurred (Negative) [20]. Confusion matrix is a method used to calculate accuracy as above. Accuracy measurement is done by confusion matrix testing as shown in Table 2.  (1) ……… (2) ……… (3) ……… (4) ……… (5) ……… (6) International Journal of Computer and Information System (IJCIS) Peer Reviewed -International Journal Vol : Vol. 01, Issue 01, May 2020 e-ISSN : 2745-9659 https://ijcis.net/index.php/ijcis/index with: TP is True Positive, that is, real data is positive and correctly classified (positive) by the system. TN is True Negative, that is, the real data is negative and correctly classified (negative) by the system. FN is False Negative, that is, the real data is negative but is wrongly classified (positive) by the system. FP is False Positive, that is, real data is positive but is classified as wrong (negative) by the system.
From the training data set that is applied to the model, the confusion matrix produces several values, namely Accuracy, Precision, Recall, and F-Measure. Accuracy is the percentage of cases that have predictions and real values are both positive (TP) orboth negative (TN) compared to the total number of cases [15]. Precision or confidence is the ratio between cases that have predictions and real values are both positive (TP) compared to the overall positive predicted cases (TP + FP). Recall or sensitivity is the ratio of cases that have predictions and real values that are equally positive (TP) compared to cases having positive real data (TP + FN) [20].

III. RESULT AND ANALYSIS
The research process starts from retrieving data from the school database and transforming the data into a dataset as in Table 3.

Information Gain Variable Selection
Information gain is used to select the optimal attributes in making predictions. Calculations are based on the dataset in Table 3. The steps in calculating the information Gain method are as follows : 1. Determining TotalEntropy To calculate total entropy as in Equation 5.From Tabel 3we get the number of total data 236 with two classes that is "Ontime" with104 data and "Overdue" with 132 data. The total entropy computed as: Total Entropy = −  Table 4. , the gain value of the attributes in dataset is calculated. The sorted results are put alltogether in Table 5.

Naive Bayes Calculation
The 4-fold cross validation is carried out to dataset in Table2. Below is the calculation for Fold 4 using all determinant variable: 1. Determine the value for each class of 177 training data K1 (Class "Ontime") = 73 K2 (Class "Overdue") = 104 2. Determine the Probability of Each Attribute Value from 177 training data "Ontime" probability is calculated from the number of data in the "Ontime" class on an attribute value divided by the number of class data "Ontime" (K1).The "Overdue" probability is calculated from the number of " Overdue " class on an attribute value divided by thenumber ofclass data " Overdue " (K2). The complete calculation results are shown in Table 6.

Examining the Testing Data
Testing Data is taken from the dataset in Table 3 which is not used as training data, that is 59 data as shown this.  Table 7 is predicted to find out whether it is overdue or on-time payment byusing Equation 3. The results are shown in Table 8.

Test Result
Based on the results of testing data of 59 records in Table 8, testing can be done using the Confusion Matrix method that can be obtained as shown in Table 9.

Implementation of the Naive Bayes Method with Information Gain
The calculation is done in the same way as the calculation using the Bayes method, but only uses the 4 best parameters.The results of system testing on the Naïve Bayes method using selected variables from the Information Gain calculation are shown in Table 10.

K-NN Calculation
Predictions using the K-NN method can be done by calculating the proximity value between training data and data testing with the Euclidean Distance equation [21]. To calculate the proximity distance between training data and testing data, the data set in Table 3 needs to be converted into numbers. The first step in calculating the proximity of training data and data testing requires a K value to produce the best accuracy value. The researchers try to predict the late payment with a K value of 1, 3,5,7,9,11,13,15. After determining the value of K, the data in Table 2 that has been converted will be broken down into training data and testing data. For the 4th Fold testing, the training data is shown in Table 11 and the testing data is in Table 12. Table 11. K-NN Training Data Table 12. K-NN Testing Data Calculation of prediction results on the testing data in Table 12 is done by comparing each testing data with all training data. Calculation of distance by using the Distance Test is shown in Equation 11.
The confusion matrix calculation using 4-Fold Cross Validation is shown in Table 13.

Calculation of K-NN + Information Gain
With Information Gain feature selection added to to K-NN method, the confusion matrix calculations using 4-Fold Cross Validation are shown in Table 14.

Method Comparison
The results of each methods obeservation before to be put alltogether in a table as shown in Table 15.The table shows that the Naïve Bayes + Information Gain algorithm has the best performance with an accuracy value of 67%. IV. CONCLUSION This research has produced a data mining model that is able to carry out the Ontime or Overdue class classification based on attributes: parent income, family dependence, father's education, father's age, mother's education and mother's age. To make the best method selection, the researchers made a comparison of four methods, namely the Naïve Bayes method, Naïve Bayes + Information Gain, K-NN, and K-NN + Information Gain. The best method is obtained from the combination of Naïve Bayes algorithm with information gain feature selection which produces an accuracy value = 67%, precision = 64%, recall = 59% and fmeasure = 61%.

Record
Parents