6.1 Feature selection

Before fitting any survival model or clustering algorithm to the data, the initial step consists in selecting variables that are discriminating in terms of churn hazard. Based on Kaplan-Meier analysis depicted in section 5.2, we have a general overview of features which influence the survival probability. In other words, our feature selection method relies on results obtained with descriptive statistics.

Table 6.1 shows the selected variables for 5 random observations extracted from the data set. It can be noted that these features are related to account or service information, apart from Dependents which indicates whether the client lives with any dependents (children, parents, etc) and Senior_Citizen. Furthermore, 9 out of the 10 selected variables are categorical which implies that the estimation results could be used to compare different groups of client. Monthly_Charges is the only quantitative variable used to fit clustering models and survival regressions.

Table 6.1: Explanatory variables used in survival models and cluster analysis
Senior_Citizen	Dependents	Phone_Service	Internet_Service	Online_Security	Online_Backup	Tech_Support	Contract	Payment_Method	Monthly_Charges
Yes	No	Yes	DSL	No	Yes	No	Month-to-month	Bank transfer	50.40
No	Yes	Yes	No	No	No	No	Two year	Mailed check	19.65
No	Yes	Yes	Fiber optic	Yes	Yes	Yes	Two year	Credit card	114.30
No	No	Yes	No	No	No	No	Month-to-month	Bank transfer	19.95
No	No	Yes	DSL	Yes	No	Yes	Month-to-month	Mailed check	54.45