For a long time, Cardiovascular diseases (CVD) is still one of the leading causes of death globally. The rise of new technologies such as Machine Learning (ML) algorithms can help with the early detection and prevention of developing CVDs. This study mainly focuses on the utilization of different ML models to determine the risk of a person in developing CVDs by using their personal lifestyle factors. This study used, extracted, and processed the 438,693 records as data from the Behavioral Risk Factor Surveillance System (BRFSS) in 2021 from World Health Organization (WHO). The data was then partitioned into training and testing data with a ratio of 0.8:0.2 to have an unknown data to evaluate the model that will be trained on. One problem that this study faced is the Imbalance among the classes and this was solved by using sampling techniques in order to balance the data for the ML model to process and understand well. The performance of the ML models was evaluated using 10-Stratified Fold cross-validation testing and the best model is Logistic Regression (LR) with F1 score of 0.32564. Logistic Regression model was then subjected to hyperparameter tuning and got the best score of 0.3257 with C = 0.1. Feature Importance was also generated from the LR model and the features that have the most impact is Sex, Diabetes, and the General Health of an individual. After getting the final LR model, it was then evaluated in the testing data and got a F1 score of 0.33. The Confusion Matrix was also used to better visualize the performance. And, the LR model correctly classified 79.18 % of people with CVDs and 73.46 % of people that is healthy. The AUC-ROC Curve was also used as a performance metric and the LR model got an AUC score of 0.837. The Logistic Regression model can be used in the medical field and can be utilized more by adding medical attributes to the data. Overall, this study gave us an insight and significant knowledge that can help in predicting the risk of CVDs by only using the personal attributes of an individual.
Keywords: Logistic regression, cardiovascular diseases, hyper-parameter tuning, imbalance classification, machine learning algorithms