Performance Testing of KNN and Logistic Regression Algorithms in Classifying Heart Disease Susceptibility

— The annual global death toll due to cardiovascular diseases, which fall into the category of heart and blood vessel disorders, reaches 17.9 million lives. This undoubtedly requires more attention in order to anticipate the potential risk of heart attacks that can affect anyone at any time. Data analysis or data mining approaches have become a significant contribution in the field of information technology to provide valuable information regarding the risk of heart diseases. Data analysis using the K-Nearest Neighbor and Logistic Regression algorithms is expected to provide information related to the susceptibility category for heart diseases, such as age susceptibility, gender, cholesterol levels, and so on. With the information obtained from this data analysis, it is hoped that it can serve as a reference and consideration for individuals to be more vigilant in maintaining their health. The results indicate that the highest correlation with susceptibility to heart disease is based on a person's age and their body weight. The correlation coefficient between these two variables is 0.37, suggesting a relationship between a person's age and their body weight, which can make them more susceptible to heart disease. Testing with both algorithms shows a high level of accuracy, with K-Nearest Neighbor achieving an accuracy rate of 0.95, while Logistic Regression has an accuracy of 0.96.


I. INTRODUCTION
In 2021, the World Health Organization (WHO) released data indicating that 1 out of every deaths worldwide is caused by heart disease.In Indonesia itself, the leading causes of heart disease are mainly attributed to improper diet, obesity, lack of physical activity, and excessive tobacco consumption [1].Lack of information about the factors that can lead to heart disease is one of the main reasons for the delay in preventing the disease.[2], [3].
Data analysis or data mining approaches have become one of the contributions in the field of information technology to provide valuable information regarding the risk of heart disease.[4].Data analysis approaches are expected to provide information related to the susceptibility category for heart diseases, such as age susceptibility, gender, cholesterol levels, and so on.With the information obtained from this data analysis, it is hoped that it can serve as a reference and consideration for individuals to be more vigilant in maintaining their health.. Data processing/data mining is one of the expertise areas that currently receives attention in various fields.This is because proper data processing can yield valuable information in various sectors of society.Each algorithm and method will yield different levels of accuracy, which is a consideration in determining which algorithm or method has a higher level of correctness.Performance testing by comparing the accuracy levels on the same objects is one way to find out the accuracy of using an algorithm on predetermined objects.
Based on the description above, the author believes that it is indeed necessary to conduct research to identify potential factors that may lead to heart disease.This is, of course, closely related to the high mortality rate caused by this condition.Additionally, the use of classification methods to determine the factors causing heart disease should consider accuracy levels to provide a high level of confidence.

Theoretical Foundation
In this research, data mining and classification are used as methods in problem-solving.Data Mining is a process of discovering relationships or patterns from hundreds or thousands of fields in a large relational database.Data Mining is also often referred to as a series of processes to extract added value in the form of previously unknown information.Data Mining is primarily used to search for knowledge within large databases and is often referred to as Knowledge Discovery in Databases.[5], [6].
Classification in data mining is one of the primary tasks aimed at grouping or categorizing data into specific classes or categories based on the attributes possessed by the data.The main goal of classification is to build a model that can predict the category or class of unlabeled data based on patterns found in labeled data.[7].

Pearson Correlation
The Pearson correlation test is a statistical method used to measure the extent of the linear relationship between two continuous variables.This method produces the Pearson correlation coefficient (r), which measures the strength and direction of the relationship between the two variables.
The Pearson correlation coefficient (r) can be calculated using the following mathematical formula.[8] [9]: Explanation of the formula : The values of both variables  ̅ : The average of variable X  ̅ : The average of variable Y 2. K-Nearest Neighbor K-Nearest Neighbors (KNN) is an algorithm in machine learning used for classification and regression tasks.This algorithm operates by finding a certain number of nearest neighbors (called "K") from an unlabeled data point and then performing classification or regression based on the majority or average of the labels of those neighbors [9] .

Logistic Regression
Logistic regression is one of the techniques in statistics and machine learning used for regression analysis in classification problems.Despite having the word "regression" in its name, logistic regression is actually used to classify data into two or more categories or classes based on a set of attributes or features.It is a very commonly used binary classification algorithm.Logistic regression models the probability that a sample of data belongs to one of the two possible categories or classes (usually referred to as the positive class and the negative class).Mathematically, the logistic regression model models the probability of the positive class (y = 1) as a function of the log-odds of independent variables (features or attributes) and model parameters.[10].
• Logit Function (Log-Odds): The logistic regression model calculates the logodds (logit) of the probability of the positive class as follows: The log-odds values are transformed into probabilities using the sigmoid function (logistic function), which produces probability values between 0 and 1: Explanation:

P(y=1)
: the probability that a sample of data belongs to the positive class, e : Euler's number (mathematical constant).
logit(P(y=1)) : the log-odds value calculated in the previous step.

II. RESEARCH METHODS
This research begins with the collection of the dataset to be used.Once the dataset is obtained, the data will be processed using the Python programming language.The first step is to perform the Pearson correlation test to determine the correlation or the level of relationship between variables that are susceptible to heart disease.After the Pearson correlation test is completed, testing is conducted using the K-Nearest Neighbor and Logistic Regression algorithms to determine the accuracy level that can be achieved using both algorithms.
The classification process using both algorithms starts with training on the dataset, followed by testing on the data.From these two processes, conclusions can be drawn regarding the accuracy values of both algorithms, and they can be compared.This is further illustrated in Figure 1

III. RESULT AND ANALYSIS
In this research, the dataset used can be shown in Figure 2, which contains attributes such as age, sex, smoking status, etc.

Pearson Correlation Test
The Pearson correlation test starts with importing the required libraries, followed by preparing the dataset in the form of NumPy arrays.Next, the Pearson correlation and p-value are calculated, and from these calculations, the correlation coefficient is analyzed.The results of the Pearson correlation test in this research show a Pearson coefficient of 0.027 and a p-value of 0.05.The relationship between variables is illustrated in Figure 4. From Figure 4, it can be observed that the highest correlation occurs between age and BMI (Body Mass Index), which is 0.37, indicating a relationship that can lead to the occurrence of heart disease.The secondhighest correlation is between age and Hypertension with a correlation of 0.28, showing a connection between the two variables that is at risk of heart disease.Further details are shown in Table 1.The results from Table 1 indicate that the age variable has a significant influence on the risk of heart disease.From the correlation test results, it is found that there is a correlation between several variables indicating a risk of heart disease.However, the values obtained show that the correlation is at a low level.This may occur due to several reasons, one of which is the dataset used.

K-Nearest Neighbor Test
The application of K-Nearest Neighbor in this classification is carried out by taking 25% of the data for testing purposes.After testing, the results obtained are shown as follows.
Figure 5. Classification Report KNN Figure 5 is a report of the data processing results using K-nearest neighbor, which can be further simplified in Table 2.

Logistic Regression Test
The implementation of the Logistic Regression method in classification also uses a scenario using 25% of the data.The results of this scenario are as follows: Figure 6.Classification Report Logistic Regression Figure 6 is the report of the data processing results using Logistic Regression, which can be further simplified in Table 3.

Interpretation of the results.
Based on the conducted tests in this research, it was found that the variable with the highest percentage of susceptibility to heart disease is age combined with an unhealthy body weight.This indicates that individuals of a certain age who do not maintain an ideal body weight, based on this research, are more susceptible to heart disease.
In accordance with the scenarios conducted, which involved testing data using 25% of the dataset, an accuracy of 95% was obtained for K-Nearest Neighbor, while Logistic Regression, using the same scenario, achieved an accuracy of 96%.Therefore, it can be concluded from the conducted tests that the Logistic Regression method or algorithm has a higher level of accuracy compared to the K-Nearest Neighbor method or algorithm.

VI. CONCLUSION
Based on the research conducted, it can be concluded that age and BMI variables have the highest correlation that can affect someone's susceptibility to heart disease.Furthermore, the logistic regression method exhibits a higher level of accuracy compared to the KNN method.For a more comprehensive presentation, the author provides the following: 1.The best method based on the conducted tests is the Logistic Regression method with an accuracy of 96%.Meanwhile, the accuracy for the K-Nearest Neighbor method was found to be 0.95.
Figure 1.Research Flow
is the log-odds value that a sample of data belongs to the positive class o  0 ,  1 , … ,   are the coefficients or model weights that need to be estimated during training..

Collecting Data Pearson correlation test Classification KNN Classification LR Training Testing Training Testing Result Accuracy International Journal of Computer and Information System (IJCIS) Peer
Reviewed -International Journal Vol : Vol.

Table 2 .
Detailed Results of K-Nearest Neighbor

Table 3 .
The detailed results of Logistic Regression.

International Journal of Computer and Information System (IJCIS)
Peer Reviewed -International Journal Vol :

Vol. 04, Issue 04, October 2023 e-ISSN : 2745-9659
https://ijcis.net/index.php/ijcis/index 2. The highest risk factor for heart disease in this research is the correlation between age and BMI, with a correlation coefficient of 0.37.Based on this, it is known that there is a relationship between a person's age and their body weight, leading to the occurrence of heart disease.3.It is worth noting that the correlation values obtained from the tests tend to be low with the dataset used.This suggests that future research should consider using different datasets or expanding the percentage of data used.