6.1 Feature selection
Before fitting any survival model or clustering algorithm to the data, the initial step consists in selecting variables that are discriminating in terms of churn hazard. Based on Kaplan-Meier analysis depicted in section 5.2, we have a general overview of features which influence the survival probability. In other words, our feature selection method relies on results obtained with descriptive statistics.
Table 6.1 shows the selected variables for 5 random observations extracted from the data set. It can be noted that these features are related to account or service information, apart from Dependents
which indicates whether the client lives with any dependents (children, parents, etc) and Senior_Citizen
. Furthermore, 9 out of the 10 selected variables are categorical which implies that the estimation results could be used to compare different groups of client. Monthly_Charges
is the only quantitative variable used to fit clustering models and survival regressions.
Senior_Citizen | Dependents | Phone_Service | Internet_Service | Online_Security | Online_Backup | Tech_Support | Contract | Payment_Method | Monthly_Charges |
---|---|---|---|---|---|---|---|---|---|
Yes | No | Yes | DSL | No | Yes | No | Month-to-month | Bank transfer | 50.40 |
No | Yes | Yes | No | No | No | No | Two year | Mailed check | 19.65 |
No | Yes | Yes | Fiber optic | Yes | Yes | Yes | Two year | Credit card | 114.30 |
No | No | Yes | No | No | No | No | Month-to-month | Bank transfer | 19.95 |
No | No | Yes | DSL | Yes | No | Yes | Month-to-month | Mailed check | 54.45 |