3.4 Nonparametric models
When dealing with duration data, these methods are helpful to have a general overview of the raw (or unconditional) hazard. Nonparametric models are rather used for data description than prediction. No explanatory variable is included in these models except for treatment variables such as the type of contract a customer has subscribed.
3.4.1 Notations
Let us consider a sample with \(N\) observations with \(k\) ordered discrete failure times (e.g. a failure can be a churn event), such that \(\forall j \in [\![1; k]\!]\) :
- \(t_j\) the \(j^{\text{th}}\) discrete failure time,
- \(d_j\) the number of spells terminating at \(t_j\),
- \(m_j\) the number of right-censored spells in the interval \([t_j, t_{j+1}]\),
- \(r_j\) the number of exposed durations right before time \(t_j\) i.e. at time \(t_j^{-}\), such that: \[\begin{equation} r_j = (d_j + m_j) + \dots + (d_k + m_k) = \sum_{l|l \geq j} (d_l + m_l) \tag{3.6} \end{equation}\]
3.4.2 Hazard function estimator
As the instantaneous hazard at time \(t_j\) is defined as \(\lambda_j = P[T=t_j|T\geq t_j]\), a trivial estimator of \(\lambda_j\) is obtained by dividing the number of durations for which the event is realized at \(t_j\) by the total number of exposed durations at time \(t_j^{-}\). Formally, it is expressed as:
\[\begin{equation} \hat{\lambda}_j = \frac{d_j}{r_j} \tag{3.7} \end{equation}\]
3.4.3 Kaplan-Meier estimator
Once the hazard function estimator computed, the discrete-time survivor function can be estimated using the Kaplan-Meier product-limit estimator. To estimate the survival at time \(t\), this estimator computes the joint probability that a spell stays in the same state until \(t\) (e.g. remaining loyal to a firm until a certain time). This method is based on conditional probabilities and the survival function estimate is defined as:
\[\begin{equation} \hat{S}(t) = \Pi_{j|t_j \leq t} \big(1-\hat{\lambda}_j\big) = \Pi_{j|t_j \leq t}\frac{r_j - d_j}{r_j} \tag{3.8} \end{equation}\]
When plotting the survival curve after having performed the Kaplan-Meier estimation, confidence bands are also added to the plot in order to reflect sampling variability (Cameron and Trivedi 2005). The confidence interval of the survival function \(\hat{S}(t)\) is derived from the estimate of the variance of \(S(t)\) which is obtained by the Grenwood estimate as in equation (3.9).
\[\begin{equation} \widehat{\mathrm{V}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j|t_j \leq t} \frac{d_j}{r_j(r_j-d_j)} \tag{3.9} \end{equation}\]
3.4.4 Nelson-Aalen estimator
The cumulative hazard function estimate is given by the Nelson-Aalen estimator which consists in summing up the hazard estimates for each discrete failure time.
\[\begin{equation} \hat{\Lambda}(t) = \sum_{j | t_j \leq t} \hat{\lambda}_{j} = \sum_{j | t_j \leq t} \frac{d_j}{r_j} \tag{3.10} \end{equation}\]
Exponentiating \(\hat{\Lambda}(t)\), one can obtain a second estimate of the survival function (see proof (6.4) in the appendix):
\[\begin{equation} \tilde{S}(t) = \exp \big( -\hat{\Lambda}(t) \big) \tag{3.11} \end{equation}\]