Machine Learning Predictive Models Analysis on Telecommunications Service Churn Rate

: Customer churn frequently occurs in the telecommunications industry, which provides services and can be detrimental to companies. A predictive model can be useful in determining and analyzing the causes of churn actions taken by customers. This paper aims to analyze and implement machine learning models to predict churn actions using Kaggle data on customer churn. The models considered for this research include the XG Boost Classifier algorithm, Bernoulli Naïve Bayes, and Decision Tree algorithms. The research covers the steps of data preparation, cleaning, and transformation, exploratory data analysis (EDA), prediction model design, and analysis of accuracy, F1 Score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) score. The EDA results indicate that the contract type, length of tenure, monthly invoice, and total bill are the most influential features affecting churn actions. Among the models considered, the XG Boost Classifier algorithm achieved the highest accuracy and F1 score of 81.59% and 74.76%, respectively. However, in terms of efficiency, the Bernoulli Naïve Bayes and Decision Tree algorithms outperformed XG Boost, with AUC scores of 0.7469 and 0.7468, respectively.


Introduction
Customer churn refers to the action taken when a customer stops using a service or accessing content from a service provider. Essentially, it involves breaking a contract with a company. This action can have detrimental effects on service companies that strive to compete and satisfy their customers. Identifying potential churn actions is crucial for maintaining good customer service and ensuring the revenue of a service provider. Several factors can contribute to churn actions, including customer dissatisfaction resulting from poor service experiences, problematic products, lack of communication platforms, and the absence of customer loyalty programs. On average, the telecommunications industry experiences an annual churn rate of 20-40%. To maintain revenue, it is more cost-effective for companies to focus on retaining existing customers rather than acquiring new ones, as the latter can be five to ten times more expensive.
Churn rate modeling and analysis play a vital role in the telecommunications industry [1][2]. Creating a predictive churn model involves multiple steps, such as data collection, understanding, pre-processing, learning, model design, development, validation, and evaluation [3,4]. The effective use of training and testing data simplifies the process, ensuring the accuracy and effectiveness of the designed model. However, due to the unbalanced nature of churn data, with most customers belonging to the non-churn class, traditional machine learning methods struggle to achieve accurate classification rates. This challenge highlights the importance of customer relationship management and marketing processes in minimizing churn amid intense competition and rapid developments in telecommunications services. Customer churn can be categorized into two types: voluntary and involuntary. Voluntary churn occurs when customers willingly participate in the churn action, while involuntary churn is a consequence of delayed billing leading to the termination of a subscription. Contract terminations can result in churn from customers. Among these two types, companies often face greater difficulty in retaining customers who churn voluntarily. Understanding the reasons behind contract terminations and identifying service deficiencies can be challenging for companies. Figure 1 illustrates the classification of customer churn into various groups [4]. Within voluntary churn, further division can be made into two sub-categories: incidental churn and deliberate churn. Incidental churn occurs when changes in a customer's life circumstances force them to terminate a contract with a company. On the other hand, deliberate churn happens when customers choose to switch to competitors or adopt new technology that offers highquality services at competitive prices, catering to their needs [5]. Technological advancements, price competition, influence from friends and family, and experiments conducted by many individuals are among the many factors contributing to churn. Intense competition often leads to customers switching services from one company to another [6]. Many studies have explored churn prediction models using various classification methods. The accuracy value often serves as the primary parameter for evaluating the performance of these models. However, it is crucial to consider the specific data and features being analyzed, as well as the necessary data processing and information related to the available features and modeling techniques. Machine learning methods offer predictive models that can be adapted to different datasets [7][8][9][10][11][12][13]. Many of these studies employ accurate meta-heuristic algorithms. Some researchers have focused on improving sample data through more efficient pre-processing methods, such as incorporating social features in data extraction and selecting appropriate algorithms [14].
Several studies have examined predictive models using comparable datasets [3,7,8,[15][16][17][18][19][20][21][22][23][24][25][26][27][28][29]. These studies indicate that the SVM algorithm often achieves the highest accuracy value. Another approach involves dimension reduction through feature selection before implementing classifiers, and some researchers have employed stratified splitting to improve training results on the classifier [30,31]. These approaches have resulted in increased prediction accuracy and improved performance efficiency. With numerous studies conducted in the past five years and beyond, there is still ample room for improvement in every machine learning algorithm used for predicting customer churn across various industries. Given the wide range of machine learning algorithms available, this research aims to compare common algorithms as a starting point, providing an opportunity to consider various methods. Each algorithm possesses unique characteristics in terms of its prediction system, and this study aims to simplify the available options for predicting customer churn and further development. The dataset used primarily consists of categorical and numerical features. The analysis of the dataset employs common machine learning algorithms, as shown in Table 1. Table 1. Comparision between 10 classifier model used in this research.

Model
Advantages between Common Classifier Models

Logistic Regression
Uses statistical approach to analyze with one or more independent variables to find the best fitting model that describes the relationship between variables [32] Random Forest a supervised machine learning model that solves two-group classification issues using classification techniques. An SVM model can classify incoming text after receiving sets of labeled training data for each category [24].

Support Vector Machine
a supervised machine learning model that solves two-group classification issues using classification techniques. An SVM model can classify incoming text after receiving sets of labeled training data for each category [33,34].
KNN perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine. One of the most fundamental classification method and simple to use [35].

Decision Tree
Most important feature is the capability of capturing descriptive decisionmaking knowledge from the supplied data, since it has the ability to its ability to use different feature subsets and decision rules at different stages of classification [36,37].

Bernoulli Naive Bayes
It is based on probability models that incorporate strong independence assumptions [32].
Discriminant Analysis a reliable classification technique that supports dimension reduction whether or not data normalcy is assumed. It has closed-form solutions that are simple to compute, naturally multiclass, have a good track record in reality, and don't require tuning of any hyperparameters. [33].

ADA Boost
In order to improve the effectiveness of binary classifiers, an ensemble learning technique called "meta-learning" was first developed. AdaBoost employs an iterative methodology to improve weak classifiers by learning from their errors and replacing them with stronger ones. [35]. Gradient Boost Known as ADA Boost with Weighted Minimization, this technique can reduce loss-that is, the discrepancy between the training example's actual class value and the predicted class value. Although understanding the method for decreasing the classifier's loss is not necessary, it works similarly to gradient descent in a neural network [38].

XG Boost
A fast and accurate solution to a variety of data science issues can be found in the parallel tree strengthening technique XGBoost (based off Gradient Boosting), also called GBDt or GBM. Beyond thousands of examples, it is possible to solve problems with the same algorithm that operates in core distributed environments like Hadoop, SGE and MPI [39]. This research focuses on churn rate prediction modeling using a dataset obtained from Kaggle. The aim was to compare and analyze different algorithms to gain insights and contribute to the identification of customer churn. By designing predictive models and implementing various classification techniques, this research aimed to provide valuable findings. The algorithms considered for comparison include Logistic Regression, Random Forest Classifier, Support Vector Machine (SVM), Bernoulli Naive Bayes Classifier, K-Nearest Neighbor Classifier, Decision Tree Classifier, ADA Boost Classifier, Gradient Boost Classifier, and XG Boost Classifier. The effectiveness and accuracy of these models are evaluated using parameters such as accuracy value, average F1 score, ROC curve, and AUC score. These metrics help determine the performance of each model in predicting customer churn. Overall, this research describes predictive modeling techniques specifically tailored to address customer churn issues in the telecommunications sector. Figure 2 illustrates the methodology employed in this research. Prior to analyzing and comparing machine learning algorithms using the dataset, several key stages are essential for feature clarification and further processing. The initial stage involves identifying the dataset obtained from open-source data, specifically focusing on telecommunications service providers. The subsequent stage is initialization, which includes identifying dataset features, data types, correlations, addressing empty or invalid data, transforming values in columns into numeric values, and performing data normalization [40,41].   After the data preparation stage, various machine learning models are implemented. The data is divided into training and testing sets. The performance of the models is evaluated using the confusion matrix, which provides information about different parameters, such as the F1 score and model accuracy. In this research, the accuracy and F1 score equations described in Equation (1) and Equation (2) are considered [26]. The confusion matrix consists of four assessment sections: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). The definitions of each part of the confusion matrix are as follows:

Materials and Methods
• True positives: Number of churns made that are correctly predicted to be actual churns.
• True negatives: Number of non-churns that are correctly predicted as non-churns.
• False positives: The amount of churn made is incorrectly predicted as non-churn.
• False negatives: The number of non-churns committed is incorrectly predicted as real churn.
The efficiency of the designed model can be assessed using the parameters of the receiver operating characteristic curve (ROC) and the area under the ROC curve (AUC). The ROC curve illustrates the performance of a classification model across different classification thresholds. On the other hand, the AUC score is a measurement method that quantifies the twodimensional area under the overall ROC curve. The higher the AUC score, the better the model is at differentiating between positive and negative classes. A perfect AUC score of '1' indicates that the classifier can accurately distinguish between all 'positive' and 'negative' class points. Conversely, if the classifier predicts all negatives as positive and vice versa, the AUC score is '0'. The ROC curve utilizes parameters such as true positive rate (TPR) and false positive rate (FPR). It represents the relationship between TPR and FPR at various classification thresholds. The AUC score provides a comprehensive measure of performance across all possible classification thresholds in the model.

Results and Discussion
The dataset utilized in this research is sourced from open-source data available on Kaggle, specifically the IBM sample dataset, which contains information on customer data and program retention [41]. The dataset encompasses personal information of customers, services associated with the utilized program, churn actions performed by customers, and customer demographics. In total, the dataset comprises 21 features, with a recorded count of 7,044 customers. Figure 4 illustrates the distribution of the various services utilized by customers.
After exploring data related to personal data, the identified points are as follows:  50.5% of customers are male, and 49.5% of customers are female  16.2% registered as elderly residents  48.3% of customers have spouses, while there are around 30% of customers have dependents The more identified information is as follows:  60% of customers choose to do paperless billing  33.6% of all customers use electronic checks as a payment method  55% of customers choose a month-to-month contract in subscribing to services. The rest are 1-year or 2-year contracts.  The distribution of subscribers more (around 800 customers) in the first month, but some customers for approximately 72 months (around 500 customers)  Customers using a 2-year contract tend to last longer in subscribing to services (around 72 months), while customers using a monthly contract (month-to-month) tend to subscribe for approximately 1-2 months.  Figure 5 illustrates the distribution of customer churn rates, with 73.4% of customers not conducting churn. To balance the number of customers who choose to churn and those who do not, alternative methods are necessary. One approach is the use of stratified sampling/splitting or multilevel data separation to prevent an excessive number of false negatives, which can lead to lower accuracy in predicting churn actions [30,31,42,43]. Stratified sampling helps achieve a balanced dataset and reduces skewness. Another method is cross-validation, which provides a stable dataset and reliable estimates of model performance. It can also be used to compare different models and training algorithms, as well as to determine optimal model parameters [44]. For example, by randomly sampling non-churners and balancing the number of nonchurners with churners before splitting the data into training and testing sets for each classifier method, the data can be balanced. However, it should be noted that poor data splitting can result in inaccurate and highly variable model performance [45,46]. The followings are findings obtained in the features exploration process and their relationship to the churn rate that occurs, among others: 1. Many subscribers who use fiber optic services to provide internet are out of contract. On the other hand, customers using DSL churn less frequently. 2. Customers without internet service have very low churn rates. 3. Citizens with senior status have nearly double the churn rate than younger populations. 4. Longer the service provider, the more likely the customer will not churn. 5. The rate of customers churning is higher when the total cost or bill is lower. 6. A higher percentage of customers churn when their fees or monthly bills are high.
After identification, exploration, observation, and analysis, the models were applied to determine the accuracy value and F1 score. The data was divided into 80% for training and 20% for testing. Table 2 presents the results for each model, indicating that XG Boost has the highest F1 score and accuracy score, with values of 74.76% and 81.59%, respectively. This outcome can be attributed to the slow learning and implementation of parallel trees, which scan gradient values and utilize the partial sum of these gradient values to evaluate the quality of data splitting across all possibilities. Other results demonstrate that the accuracy value and F1 score are comparable across all models. This suggests that the data has not yet reached optimal conditions. It is possible that the data has not fully recovered from skewness and lacks a proper distribution to create more accurate predictive models. Optimal data quality encompasses factors such as accuracy, completeness, data consistency, integrity, duplication, and timeliness [47]. A good and optimal dataset should take into account these metrics. The used dataset has a limitation in terms of the uneven size between churners and non-churners, leading to skewness and potential outliers. These outliers could result from data entry errors, measurement errors, or natural occurrences. Further studies can be conducted to identify and minimize outliers in this dataset, thus optimizing the performance of machine learning algorithms. As the dataset employed in this study has not been previously researched by other scholars, it represents a unique contribution and has not been utilized in other studies thus far. Implementing various machine learning algorithms reveals that customers with a 2month contract with the company can reduce churn. Figure 6 illustrates the feature correlation in model design and analysis using logistic regression algorithms. Variables such as total billing, monthly contracts, internet services via fiber optic cables, and the senior status of customers have a significant impact on churn.  Figure 7 displays the feature importance in the dataset according to the XG Boost Algorithm. It highlights the significance of monthly and total bills in determining customer churn actions. In terms of customer services, the obtained significance values tend to fall in the category of minor importance to the churn action's effect. Figure 7. Feature importance in the data set based on the XG Boost Algorithm. Figure 8 illustrates the ROC curve and AUC score obtained from the implementation of machine learning algorithms in predictive models for churn actions. The AUC score represents the level of separability between features under the ROC curve, which is a probability curve reflecting the performance of a classifier model at different thresholds. It indicates how well the model can differentiate between classes [48]. A higher AUC indicates that the model is more accurate at classifying the '0' classes as '0' and the '1' classes as '1'. For example, the higher the AUC, the better the model is at differentiating between true and false boolean answers. Based on the results of the ROC curve and AUC score, the Bernoulli Naïve Bayes and Decision Tree algorithms show the highest efficiency with nearly identical values. Although XG Boost has a higher F1 score and accuracy value compared to Decision Tree and Bernoulli Naïve Bayes, in terms of predictive model efficiency, XG Boost scores relatively lower than the two models.

Conclusions
The type of contract, tenure, monthly bill, and total bill are the features that have the most significant influence on customer churn actions. Among the customer services, fiber optic cable service has the highest impact on churn, while DSL service has a minimal effect. Customer attributes such as gender and seniority status do not significantly affect churn tendencies. The results indicate that the XG Boost Classifier algorithm performs the best, achieving an accuracy value of 81.59% and an F1 Score of 74.76%. In terms of efficiency, the Bernoulli Naïve Bayes and Decision Tree algorithms show AUC scores of 0.7469 and 0.7468, respectively. The ROC curve for these models is better than that of XG Boost, despite XG Boost having the highest accuracy and F1 score.