6.1 Feature selection

Before fitting any survival model or clustering algorithm to the data, the initial step consists in selecting variables that are discriminating in terms of churn hazard. Based on Kaplan-Meier analysis depicted in section 5.2, we have a general overview of features which influence the survival probability. In other words, our feature selection method relies on results obtained with descriptive statistics.

Table 6.1 shows the selected variables for 5 random observations extracted from the data set. It can be noted that these features are related to account or service information, apart from Dependents which indicates whether the client lives with any dependents (children, parents, etc) and Senior_Citizen. Furthermore, 9 out of the 10 selected variables are categorical which implies that the estimation results could be used to compare different groups of client. Monthly_Charges is the only quantitative variable used to fit clustering models and survival regressions.

Table 6.1: Explanatory variables used in survival models and cluster analysis
Senior_Citizen Dependents Phone_Service Internet_Service Online_Security Online_Backup Tech_Support Contract Payment_Method Monthly_Charges
Yes No Yes DSL No Yes No Month-to-month Bank transfer 50.40
No Yes Yes No No No No Two year Mailed check 19.65
No Yes Yes Fiber optic Yes Yes Yes Two year Credit card 114.30
No No Yes No No No No Month-to-month Bank transfer 19.95
No No Yes DSL Yes No Yes Month-to-month Mailed check 54.45