On the special nature of survival data

Jacobo de Uña-Álvarez
CINBIO Universidade de Vigo SiDOR research group
ORCid: 0000-0002-4686-8417jacobo@uvigo.gal

Adrián Lago
CINBIO Universidade de Vigo SiDOR research group
ORCid: 0000-0002-6573-043Xadrian.lago@uvigo.gal


Abstract

The estimation of a survival function is most of the times a non-trivial issue due to the special nature of the sampling information. Survival data typically suffer from random censoring and/or truncation, as recognized in most textbooks on the topic. In this work we revisit these issues and discuss the difficulties that appear when handling censored and truncated survival data. Special attention is paid to situations in which the nonparametric maximum-likelihood estimator of the survival function may degenerate, be non-unique or even non-existing. Illustrative examples, simulation studies and real data applications are included. R code is provided.

Keywords: Censoring, Nonparametric maximum-likelihood, Sampling bias, Survival Analysis, Truncation.

MSC Subject classifications: 62G05, 62N02, 62N05.

Introduction

Statistical methods for survival data and lifetimes date from the late fifties. They were rapidly developed and disseminated in the subsequent few years due to their importance in clinical research and engineering (Kalbfleisch and Prentice 1980; Fleming and Harrington 1981; Lawless 1982; Cox and Oakes 1984). This resulted in two well-established, and often overlapping, fields of statistical knowledge: (i) Survival Analysis, and (ii) Reliability. In these two areas, the target variable \(X\) is a time to a relevant event, such as recurrence or death in clinical studies, or malfunction or failure in engineering. A function of much interest in these settings is the survival (or reliability) function \(S(x)=P(X>x), x\geq 0\), which reports, for each \(x\)-value, the survival probability beyond time \(x\). Obviously, \(S(\cdot)\) is a characteristic function for the probability distribution of \(X\).

Estimation of \(S(x)\) is simple under random sampling. If \(X_1,\ldots,X_n\) are independent and identically distributed (iid) observations of \(X\), a consistent estimator of \(S(x)\) is given by



\[S_n(x)=\frac{1}{n}\sum_{i=1}^n I(X_i>x),
\label{eq:iid}\qquad(1)\]

where \(I(\cdot)\) is the indicator function. Indeed, ([eq:iid]) is the nonparametric maximum-likelihood estimator (NPMLE) of \(S(x)\) in such a case. Unfortunately, in most of the applications with survival data the iid assumption does not hold, and estimators alternative to ([eq:iid]) are needed. To be specific, in Survival Analysis and Reliability censoring and truncation issues typically affect the sampling information, and this confers a special nature to survival data (Klein and Moeschberger 2003). The same applies to other fields like, for instance, Astronomy, Economics and Epidemiology, where a random variable subject to censoring and/or truncation is often the focus of research.

In addition to \(S(\cdot)\), an important function when handling survival data is the hazard (or failure rate), which represents the instantaneous probability for the event to occur. When \(X\) has a density \(f(\cdot)\), the hazard function is given by \(h(x)=f(x)/S(x)\), \(x\geq 0\). In the discrete setting one rather has \(h(x)=P(X=x|X\geq x)\), and the survival function is linked to the hazard function through equation



\[S(x)=\prod_{x_k\leq x}[1-h(x_k)]
\label{eq:haz}\qquad(2)\]

where \(x_k\), \(1\leq k\leq K\), are the mass points of \(X\). This relationship indicates that, in order to survive beyond time \(x\), one needs to overcome the risk attached to the time points \(x_k\) smaller than or equal to \(x\). From equation ([eq:haz]) it is not surprising that ([eq:iid]) can be re-written in the product form



\[S_n(x)=\prod_{x_k\leq x}[1-h_n(x_k)]
\label{eq:hazn}\qquad(3)\]

where the \(x_k\)’s denote now the distinct values among the \(X_i\)’s, and where



\[h_n(x_k)=\frac{\#\{i:X_i=x_k\}}{\#\{i:X_i\geq x_k\}}\equiv \frac{d_k}{n_k}.
\label{eq:ratio}\qquad(4)\]

The numerator \(d_k\) and the denominator \(n_k\) in ([eq:ratio]) correspond to the number of events at time \(x_k\) and the number of individuals at risk at time \(x_k\), respectively. Note that, for discrete populations, the strong law of the large numbers (SLLN) guarantees \(h_n(x_k) \rightarrow h(x_k)\) as \(n \rightarrow \infty\) almost surely for each fixed \(x_k\). Again, this is true for random sampling; with censored or truncated data, the estimation of the hazard function requires a re-definition of the risk set \(\mathcal{R}_k=\{i:X_i \geq x_k\}\).

In this work we revisit the issues of censoring and truncation, and we discuss how a NPMLE for the survival function can be derived when the data are censored and/or truncated. The existing relationship between NPMLE and the construction of convenient risk sets is highlighted. Inverse-probability weigthed estimation (IPWE) is also reviewed in connection with the maximum-likelihood approach. We pay special attention to situations in which the NPMLE may degenerate, be non-unique or even non-existing. Illustrative examples are given.

The rest of the paper is organized as follows. Section 2 introduces the issue of right-censoring, which is the most frequent situation in follow-up studies and life testing scenarios. Left-censoring and interval-censoring are more briefly considered at the end of Section 2. Section 3 starts by describing left-truncation just as it appears in cross-sectional studies and in studies with delayed entries; right-truncation is also covered. The less known issue of double truncation, which emerges for instance with interval sampling, is detailed in Section 3 too. Throughout Sections 2 and 3 some toy examples are provided for a better understanding of the problems underlying estimation from censored or truncated data. In Section 4 simulation studies are performed, while Section 5 is devoted to illustrations with real survival data. Finally, the main conclusions of the paper are collected in Section 6.

Censoring

Right-censoring

Right-censoring occurs when the event of interest is not observed due to the end of the study, lost to follow-up cases, withdrawals from the study and so on. This implies that some of the recorded data \(X_i\) are strictly smaller than the target value. In order to account for this, each datum \(X_i\) goes with the so-called uncensoring indicator \(\delta_i\), so only cases with \(\delta_i=1\) report true event times. When \(\delta_i=0\), it is just known that \(X\) is larger than the recorded datum \(X_i\). Of course, the naive application of ([eq:iid]) to a right-censored sample results in an underestimation of the survival function. On the other hand, removing from the sample the cases with \(\delta_i=0\) does not solve the problem, since the uncensored \(X_i\) are stochastically smaller than \(X\) due to the induced selection bias; mathematically, \(P(X>x|\delta=1)\leq S(x)\), \(x\geq 0\).

A consistent estimator of \(S(x)\) under right-censoring can be obtained from ([eq:hazn]). For this, the numerator \(d_k\) in ([eq:ratio]) must be replaced by \(d_k=\#\{i:X_i=x_k,\delta_i=1\}\), the number of real events at time \(x_k\). This leads to the time-honoured Kaplan-Meier estimator (Kaplan and Meier 1958). For consistency purposes, it is implicitly assumed that the individuals that do not undergo the event of interest nor censoring before time \(x_k\) follow the target hazard \(h(x_k)\), \(1\leq k\leq K\). This is usually driven through a model with two independent latent variables: the variable of interest \(X\), and the potential right-censoring time \(C\); then, the recorded \(X_i\)’s are realizations of \(\min(X,C)\), and the \(\delta_i\)’s come from the indicator \(I(X\leq C)\). In such a model, due to the SLLN,

\[\frac{\#\{i:X_i=x_k,\delta_i=1\}}{\#\{i:X_i\geq x_k\}}\rightarrow \frac{P(X=x_k,x_k\leq C)}{P(X\geq x_k,C\geq x_k)}=h(x_k)\qquad(5)\]

with probability 1 as \(n\rightarrow \infty\), where the last equality follows from the independence between \(X\) and \(C\). This convergence helps to understand why the Kaplan-Meier technique works.

When the distribution of \(C\) is unrelated to \(S(\cdot)\), the likelihood of the \((X_i,\delta_i)\)’s is proportional to

\[\mathcal{L}_n^{(rc)}=\prod_{i=1}^n dF(X_i)^{\delta_i} S(X_i)^{1-\delta_i}\qquad(6)\]

where \(dF(\cdot)\) stands for the measure attached to the cumulative distribution function (cdf) of \(X\), \(F(x)=P(X\leq x)=1-S(x)\), \(x\geq 0\). Interestingly, this likelihood is valid for a variety of censoring mechanisms that appear in clinical studies and engineering. Clearly, maximization of \(\mathcal{L}_n^{(rc)}\) over the class of all the distribution functions can be limited to the ones concentrated on the \(X_i\)’s; this leads back to the Kaplan-Meier estimator. Large sample theory for this estimator was first developed by (Breslow and Crowley 1974).

As a toy example, consider a right-censored sample of size \(n=7\) with survival times \(X_i\) and uncensoring indicators \(\delta_i\) given respectively by the objects X and d below. In R, the Kaplan-Meier survival function is easily obtained from the function survfit() within the package survival (Therneau 2022) from the statistical software R (R Core Team 2022) as follows:

    > X <- c(0.75, 1.05, 1.25, 1.5, 2.25, 2.4, 2.5)
    > d <- c(1, 1, 0, 1, 1, 1, 0)
    > library(survival)
    > survfit(Surv(X, d) ~ 1) -> res
    > summary(res)
    Call: survfit(formula = Surv(X, d) ~ 1)
    
    time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0.75      7       1    0.857   0.132       0.6334            1
    1.05      6       1    0.714   0.171       0.4471            1
    1.50      4       1    0.536   0.201       0.2570            1
    2.25      3       1    0.357   0.198       0.1205            1
    2.40      2       1    0.179   0.160       0.0307            1

In this numerical output, the labels n.event and n.risk correspond to the \(d_k\)’s and the \(n_k\)’s respectively, so the hazard at each time point, \(h(x_k)\), is estimated by the ratio of these two columns. On the other hand, the evaluation of the Kaplan-Meier estimator along the event times, \(S_n(x_k)\), is displayed under label survival.

One embarrassing feature of the Kaplan-Meier estimator is that it does not reach zero whenever the maximum \(X_i\)-value is censored. But this is the natural consequence of the missing information beyond such maximum record. For the current toy example, the estimated survival probability beyond time \(X_7=2.5\) is \(0.179\), and there is no way, in absence of external information, to know how this survival decreases afterwards. This is somehow connected with the fact that the consistency of the Kaplan-Meier estimator \(S_n(x)\) only holds for \(x\)-values within the support of the censoring variable. In extreme situations of heavy censoring, summary measures such as the median may even not exist; this occurs when \(S_n(x)\) stays above \(1/2\) for all \(x\)’s. In particular, the Kaplan-Meier survival is constantly equal to one when all the \(\delta_i\)’s are zero.

The Kaplan-Meier estimator can be introduced as an IPWE too. Under the independent censoring model, it holds

\[F(x)=E\left[\frac{I(X\leq x)\delta}{1-G(X-)}\right]\qquad(7)\]

for \(x\) in the support of \(C\), where \(1-G(x-)=P(C\geq x)=P(\delta=1|X=x)\). The cdf \(G\) of the right-censoring time can be estimated from the Kaplan-Meier technique itself, by interchanging the \(0\)’a and \(1\)’s in the \(\delta\)-labels. Let \(G_n\) be such estimator; then, the sampling average \(n^{-1}\sum_{i=1}^n I(X_i\leq x)\delta_i(1-G_n(X_i-))^{-1}\) is just the Kaplan-Meier estimator of \(F(x)\). In words, the Kaplan-Meier estimator can be computed from the uncensored data, by inversely weighting the observations according to the selection probability \(P(\delta=1|X=x)\). See (Satten and Datta 2001) for a deeper insight.

Left-censoring

A left-censored value \(C\) of the target \(X\) occurs when it is only known that \(X<C\). Estimation of the survival function from left-censored data can be done by using the Kaplan-Meier method as described in the previous section, after noting that \(-X\) is right-censored by \(-C\). Compared to right-censoring, censoring from the left is less frequently encountered in data recruitment. Left-censoring appears in sampling designs that involve a retrospective assessment of event times (Klein and Moeschberger 2003). For instance, questionnaires on smoking habits provide a left-censored value for the age at onset \(X\) when the respondent is unable to recall his/her age at first experience; in this case, \(C\) stands for the current age of the individual. A similar situation is found in the investigation of the age at which a child learns a certain task, whenever he/she is already able to perform it at entry into study. However, examples with pure left-censoring are hardly found in practice.

Studies involving left-censored data in which the individuals are also subject to censoring from the right are more common, leading to doubly-censored data (Turnbull 1974). Under double censoring the NPMLE has no closed form, and must be computed from an iterative algorithm. This iterative procedure is based on the notion of self-consistency, which is applicable to purely right-censored samples too (Efron 1967). A detailed description of the algorithm can be found, for instance, in (Klein and Moeschberger 2003).

Interval censoring

Interval censored data often appear in longitudinal studies due to the non-continuous monitoring of the individuals. With interval censoring, the value of \(X\) is known to fall within a certain interval \((L,R]\), but \(X\) itself is unknown. When \(X\) and \((L,R)\) are independent, the conditional likelihood of the indicators \(I(L_i<X\leq R_i)\) given the \((L_i,R_i)\)’s can be used to obtain the NPMLE. This conditional likelihood is given by

\[\mathcal{L}_{n,c}^{(ic)}=\prod_{i=1}^n [S(L_i)-S(R_i)].\qquad(8)\]

As with double censoring, the maximizer of \(\mathcal{L}_{n,c}^{(ic)}\) must be obtained numerically. (Turnbull 1976) introduced a general expectation-maximization algorithm to compute the NPMLE for arbitrarily grouped, censored and truncated data on the basis of a self-consistency equation. For interval censoring he showed that the NPMLE is concentrated on the so-called Turnbull intervals \((q_j,p_j]\), \(1\leq j\leq m\); here, the \(q_j\)’s are obtained from the \(L_i\)’s, and the \(p_j\)’s are obtained from the \(R_i\)’s, in such a way that there is no other \(L_i\)– or \(R_i\)-value within \((q_j,p_j]\). For the identifiability of \(S(\cdot)\), it is often assumed that the support of \(L\) or \(R\) covers the positive half-line. Due to the scarce information, the NPMLE of the survival function of \(X\) remains constant outside Turnbull intervals, and it is unspecified between \(q_j\) and \(p_j\), \(1\leq j\leq m\); see for instance (Klein and Moeschberger 2003), (Sun 2006) or (Gómez et al. 2009) for details and access to the related literature.

In order to illustrate the nature and computation of the NPMLE with interval-censored data, we consider an example. Assume that the seven \(X_i\)-values in the example of Section 2.1 are censored by the following intervals \((L_i,R_i]\), which are the only available information: \((0.5,1]\), \((0.8,1.3]\), \((1,1.5]\), \((1,2]\), \((2.1,2.4]\), \((2.15,2.65]\), and \((2.3,2.7]\). Then, Turnbull intervals are \((0.8,1]\), \((1,1.3]\), and \((2.3,2.4]\), and the mass attached to these intervals by the NPMLE is \(0.1905\), \(0.3810\), and \(0.4286\), respectively. These results can be easily obtained from the functions MLEintvl() and EM() within the R package Icens (Gentleman and Vandal 2022). Possible code lines are as follows:

> data <- data.frame(left = c(0.5, 0.8, 1, 1, 2.1, 2.15, 2.3), 
     + right = c(1, 1.3, 1.5, 2, 2.4, 2.65, 2.7))
> library(Icens)
> TBint <- with(data, MLEintvl(cbind(left, right)))
> EMest <- with(data, EM(cbind(left, right)))
> cbind(qj = TBint[,1], pj = TBint[,2], prob.mass = round(EMest$pf, 4))
      qj  pj prob.mass
[1,] 0.8 1.0    0.1905
[2,] 1.0 1.3    0.3810
[3,] 2.3 2.4    0.4286

Truncation

Random truncation refers to a situation in which the observation of the target variable requires a condition on the \(X\)-value itself (Turnbull 1976). Let \(\mathcal{B}\) be a random set, and assume that \(X\) is observed only when \(X\in \mathcal{B}\). In this case, the \(X_i\)’s in the sample follow the conditional distribution \(F^*(x)=P(X\leq x|X\in \mathcal{B})\). When \(X\) and \(\mathcal{B}\) are independent, it holds \(dF^*(\cdot) \propto G(\cdot)dF(\cdot)\), where \(G(x)=P(x\in \mathcal{B})\) stands for the probability of recruiting an individual for which \(X=x\) has occurred, \(x\geq 0\). This has two important consequences: (a) truncation induces a selection bias whenever \(G(\cdot)\) is not flat; and (b) consistent estimation of \(S(\cdot)=1-F(\cdot)\) is possible only when \(G(\cdot)\) is positive on the support of \(X\). Fact (a) means that, in general, a substitute for the ordinary empirical cdf is needed in order to consistently estimate \(F\). Fact (b) is related to the identifiability of the target survival function.

Left-truncation and right-truncation

Left-truncation corresponds to the special case \(\mathcal{B}=[U,\infty)\) for some random variable \(U\), which is called left-truncation time, and which is assumed to be independent of \(X\). Then, \(G(\cdot)\) reduces to the cdf of \(U\). Since \(G(\cdot)\) is monotone non-decreasing, large values of \(X\) are oversampled under left-truncation. On the other hand, identifiability of \(F\) is ensured whenever the lower limit of the support of \(U\) is smaller than that of the support of \(X\).

Left-truncated data appear, for instance, with cross-sectional sampling, where the sampled individuals are those being in progress at some specific (cross-section) date, \(d_{cs}\) say. With cross-sectional data, \(U\) is defined as the time from onset to \(d_{cs}\). Delayed entries into study is another example of left-truncation; in this case, \(U\) stands for the entry time.

The NPMLE of \(S(x)\) under left-truncation can be easily computed from the available data, which are iid copies \((X_i,U_i)\), \(1\leq i\leq n\), following the conditional distribution of \((X,U)\) given \(U\leq X\). For this, one just evaluates the estimator in ([eq:hazn]) after a proper re-definition of the risk sets. Specifically, the denominator \(n_k\) in ([eq:ratio]) is replaced by \(n_k=\#\{i:X_i\geq x_k \geq U_i\}\), \(1\leq k\leq K\) (Lynden-Bell 1971; Woodroofe 1985). Here, the SLLN guarantees the almost sure convergence (as \(n\rightarrow \infty\))

\[\frac{\#\{i:X_i=x_k\}}{\#\{i:X_i\geq x_k\geq U_i\}}\rightarrow \frac{P(X=x_k|U\leq X)}{P(X\geq x_k\geq U|U\leq X)}=\frac{P(X=x_k,U\leq x_k)}{P(X\geq x_k\geq U)}=h(x_k),\qquad(9)\]

where the independence between \(U\) and \(X\) is needed for the last equality. Unlike for right-censoring, with left-truncated data the number of individuals at risk \(n_k\) is not a monotone sequence. Actually, under left-truncation \(h_n(x_k)=1\) may occur before the maximum \(X_i\)-value; in this case, \(S_n(x)\) collapses to zero at time \(x=x_k\) and beyond, leading to a degenerated estimate. Such \(x_k\)-points have been called holes in the related literature (Strzalkowska-Kominiak and Stute 2010).

The aforementioned NPMLE for left-truncated data indeed maximizes the conditional likelihood of the \(X_i\)’s given the \(U_i\)’s:

\[\mathcal{L}_{n,c}^{(lt)}=\prod_{i=1}^n \frac{dF(X_i)}{1-F(U_i-)},\qquad(10)\]

where \(F(x-)=P(X<x)\) denotes the left-continuous version of \(F(x)\). The fact that this conditional likelihood contains all the relevant information to estimate \(F(\cdot)\) is not obvious. In order to explain this, consider the full likelihood \(\mathcal{L}_n^{(lt)}=\alpha^{-n}\prod_{i=1}^n dF(X_i)dG(U_i)\), where \(\alpha=P(U\leq X)\) is the probability of no truncation. Factorize \(\mathcal{L}_n^{(lt)}=\mathcal{L}_{n,c}^{(lt)} \times \mathcal{L}_{n,m}^{(lt)}\), where \(\mathcal{L}_{n,m}^{(lt)}=\alpha^{-n}\prod_{i=1}^n (1-F(U_i-))dG(U_i)\) is the marginal likelihood of the \(U_i\)’s. Then, it is easy to see that, for each given \(F\), \(\mathcal{L}_{n,m}^{(lt)}\) is maximized by

\[G_n(u;F)=\frac{\sum_{i=1}^n (1-F(U_i-))^{-1}I(U_i\leq u)}{\sum_{i=1}^n (1-F(U_i-))^{-1}},\qquad(11)\]

and that the value of \(\mathcal{L}_{n,m}^{(lt)}\) at \(G_n(\cdot;F)\) is independent of \(F\). This proves that the full likelihood is maximized at \((F_n, G_n(\cdot;F_n))\), where \(F_n=1-S_n\) is the maximizer of \(\mathcal{L}_{n,c}^{(lt)}\).

For illustration purposes, we consider the toy example in Section 2.1, but now adapted to the left-truncated setting. The \(X_i\)-values are the same as in Section 1.2, but no censoring issue is present (that is, \(\delta_i=1\) for \(1\leq i\leq n\)). Besides, the left-truncation times \(U_i\) are those in object U below; note that \(U_i\leq X_i\) holds for \(1\leq i\leq n\). Once more, we use the R function survfit() to proceed:

    > X <- c(0.75, 1.05, 1.25, 1.5, 2.25, 2.4, 2.5)
    > U <- c(0.4, 0.3, 1.1, 1.2, 1.3, 1.1, 2.3)
    > d <- c(1, 1, 1, 1, 1, 1, 1)
> library(survival)
    > survfit(Surv(U, X, d) ~ 1) -> res
    > summary(res)
    Call: survfit(formula = Surv(U, X, d) ~ 1)
    
    time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
    0.75      2       1        0      0.5   0.354        0.125            1
    1.05      1       1        0      0.0     NaN           NA           NA
    1.25      3       1        0      0.0     NaN           NA           NA
    1.50      3       1        0      0.0     NaN           NA           NA
    2.25      2       1        0      0.0     NaN           NA           NA
    2.40      2       1        0      0.0     NaN           NA           NA
    2.50      1       1        0      0.0     NaN           NA           NA

Ouput above is practically useless. The Lynden-Bell survival function degenerates just after the second event time because \(X_2\) is a hole, as discussed above: \(\sum_{i=1}^n I(U_i\leq X_2\leq X_i)=1\). The fact that \(S_n(\cdot)\) attaches no mass to 5 out of the 7 event times is dissapointing, and illustrates the potential difficulties when handling truncated data.

Two different approaches have been suggested in order to avoid degenerated estimation of the survival function in practice. The first one consists in re-defining the target as a conditional survival probability, \(S(x|x_0)=P(X>x|X> x_0)\), \(x\geq x_0\), where \(x_0\) is located to the right of the maximum hole-value in the sample. Note that, according to ([eq:haz]), \(S(x|x_0)=\prod_{x_0<x_k\leq x}[1-h(x_k)]\), so all the event times smaller than or equal to \(x_0\) become irrelevant. For our toy example, by using \(x_0=X_2\) one gets the estimated survival probabilities \(S_n(1.25|1.05)=2/3\), \(S_n(1.50|1.05)=4/9\), \(S_n(2.25|1.05)=2/9\), \(S_n(2.40|1.05)=1/9\) and \(S_n(2.50|1.05)=0\). The second approach to avoid degeneration relies on a modification of the \(n_k\)’s, so equality \(h(x_k)=1\) is not possible before reaching the maximum \(X_i\). A standard choice here is \(\tilde n_k=(n+1)n_k/n\); this introduces only minor changes in the estimated hazards for moderate to large sample sizes \(n\), which is the typical situation in practice.

Interestingly, in all of the above the roles of \(X\) and \(U\) can be interchanged, so one comes up with an estimator of the cdf of the left-truncation time \(U\), \(G_n\) say. This can be done in two ways. An obvious route is to re-write the right-truncation condition on \(U\) as a left-truncation condition on \(-U\); that is, the events \(\{U\leq X\}\) and \(\{-X\leq -U\}\) coincide. Then, one computes an estimator of the survival function of \(-U\) by considering the left-truncation issue and, from this, an estimate for \(G\) is readily obtained. The second way to estimate \(G\) is by using arguments similar to those above for the construction of the Lynden-Bell survival function, taking into account the fact that now truncation occurs from the right. Indeed, literature on estimation from right-truncated data is very well developed (Klein and Moeschberger 2003). The issue of right-truncation appears, for instance, whenever the available \(X_i\)’s correspond to the individuals undergoing the event of interest before a given date, \(d_1\) say. This has been traditionally the situation with registries for AIDS epidemiology, where \(X\) denotes the incubation time (defined as time from VIH infection to AIDS). Let \(V\) be the time from onset to \(d_1\) in such a setting. Right-truncation occurs because only cases satisfying \(X\leq V\) are recruited. The variable \(V\) is called right-truncation time; the NPMLE of \(S(x)=P(X>x)\) in the right-truncated setting becomes

\[S_n(x)=1-\prod_{x_k>x}\left[1-\frac{\#\{i:X_i=x_k\}}{\#\{i:X_i\leq x_k\leq V_i\}}\right].\qquad(12)\]

This estimator can be derived as the maximizer of the conditional likelihood of the \(X_i\)’s given the \(V_i\)’s, in the same manner as for left-truncated data.

The duality between left-truncation and right-truncation is useful to link NPMLE and IPWE too. For instance, the NPMLE \(F_n(\cdot)\) for \(F(\cdot)=1-S(\cdot)\) under left-truncation allocates a mass proportional to \(G_n(X_i)^{-1}\) at each \(X_i\), where \(G_n(\cdot)\) stands for the NPMLE of the distribution function of the left-truncation time. This means that, for the construction of \(F_n\), each \(X_i\)-datum is inversely weighted according to its (estimated) sampling probability. Conversely, as indicated above, the \(U_i\)’s get probability masses proportional to \((1-F_n(U_i-))^{-1}\) under \(G_n(\cdot)\). These relationships will appear in the discussion of the double truncation model too; see Section 3.3.

Left-truncation with right-censoring

The ideas for the estimation of a left-truncated survival function discussed in the previous section can be easily extended to cope with an extra risk of right-censoring. This is important, since left-truncated and right-censored data are often encountered in cross-sectional studies and in studies with delayed entries.

Assume that \((U,C)\) is a random vector independent of \(X\) such that the triplet \((\min(X,C),U,I(X\leq C))\) is observed only when \(\min(X,C)\geq U\). Here, \(C\) is the right-censoring time, \(U\) is the left-truncation time, and \(I(X\leq C)\) is the uncensoring indicator. The sampling information is given by \((X_i,U_i,\delta_i)\) with \(X_i\geq U_i\), \(1\leq i\leq n\).

Define the number of individuals at risk at each \(X\)-value \(x_k\) as \(n_k=\#\{i:X_i\geq x_k\geq U_i\}\). Also, consider the number of events at time \(x_k\), \(d_k=\#\{i:X_i=x_k,\delta_i=1\}\). Then, the SLLN and the independence between \(X\) and \((U,C)\) jointly guarantee

\[\frac{d_k}{n_k}\rightarrow \frac{P(X=x_k,C\geq x_k|U\leq X)}{P(X\geq x_k, C\geq x_k\geq U|U\leq X)}=\frac{P(X=x_k,C\geq x_k\geq U)}{P(X\geq x_k,C\geq x_k\geq U)}=h(x_k).\qquad(13)\]

Obviously, the survival function computed from the estimated hazards \(h_n(x_k)=d_k/n_k\) and equation ([eq:hazn]) inherits the potential issues discussed above for Kaplan-Meier and Lynden-Bell estimators. This includes the possibility of having some probability mass at infinite (the estimated survival function does not reach zero) and the risk of holes. On the other hand, the construction of the likelihood function becomes more involved when both left-truncation and right-censoring are present. Still, the previous estimator can be introduced as a NPMLE, which is connected to IPWE principles as well (Wang 1991; Heuchenne, de Uña-Álvarez, and Laurent 2020).

Doubly truncated data

Left-truncation and right-truncation are the two possible situations with one-sided truncation. A generalization of this is the double truncation model, in which a triplet \((X,U,V)\) is observed only when \(U\leq X\leq V\). Here, \(X\) denotes again the target variable, while the couple \((U,V)\) is called truncation vector. As for one-sided truncation, it is usually assumed that \(X\) and \((U,V)\) are independent. Note that double truncation falls within the general setting of random truncation, by taking \(\mathcal{B}=[U,V]\). Under double truncation, the sampling probability for a particular value \(X=x\) is given by \(G(x)=P(U\leq x\leq V)\); conversely, the sampling probability for \((U,V)=(u,v)\) is \(F(v)-F(u-)\), where \(F\) is the cdf of \(X\).

Doubly truncated data appear in particular with interval sampling, when the sampling information is restricted to the individuals with event between two calendar times \(d_0\) and \(d_1\). In such a case, \(V\) stands for the time from onset to \(d_1\), while \(U=V-(d_1-d_0)\). Applications with interval sampling include childhood cancer epidemiology, autopsy-confirmed studies, reliability engineering, age at insolvency for firms, or marriage lengths, among many others. See (de Uña-Álvarez, Moreira, and Crujeiras 2021) for particular examples and access to the related literature.

In general, how to re-define the risk set \(\mathcal{R}_k\) under double truncation in order to obtain a consistent estimator of the hazard \(h(x_k)\) is an unsolved problem. However, a NPMLE approach to directly estimate the survival function of \(X\) is possible. Given iid observations \((X_i,U_i,V_i)\), \(1\leq i\leq n\), with the conditional distribution of \((X,U,V)\) given \(U\leq X\leq V\), the full likelihood is given by



\[\mathcal{L}_n^{(dt)}=\alpha^{-n}\prod_{i=1}^n dF(X_i)dK(U_i,V_i),
\label{eq:dtlik}\qquad(14)\]

where \(\alpha=P(U\leq X\leq V)\) and where \(K\) denotes the joint distribution function of \((U,V)\). The maximizer \((F_n,K_n)\) of ([eq:dtlik]) is the solution of the couple of equations

  • \(F_n(x)=\sum_{i=1}^n G_n(X_i)^{-1}I(X_i\leq x)/\sum_{i=1}^n G_n(X_i)^{-1}\)
  • \(K_n(u,v)=\sum_{i=1}^n (F_n(V_i)-F_n(U_i-))^{-1}I(U_i\leq u,V_i\leq v)/\sum_{i=1}^n (F_n(V_i)-F_n(U_i-))^{-1}\)

where \(G_n(x)=\int I(u\leq x\leq v)dK_n(u,v)\) (Shen 2010); as a by-product, this shows that NPMLE and IPWE are linked also under double truncation. There is no explicit solution to this optimization problem, but an iterative algorithm can be easily designed from (E1)-(E2). One may start, for example, with \(G_n(X_i)=1/n\), \(1\leq i\leq n\), in (E1), so \(F_n\) becomes the ordinary empirical cdf of the \(X_i\)’s; and then, by using (E2), \(K_n\) (and hence \(G_n\)) are updated. The procedure ends when the solution converges. Implementation of this iterative algorithm in R is provided by the function shen() within the package DTDA (Moreira, de Uña-Álvarez, and Crujeiras 2022).

As for one-sided truncation, the NPMLE \(F_n\) with doubly truncated data can be obtained by maximizing a conditional likelihood, namely

\[\mathcal{L}_{n,c}^{(dt)}=\prod_{i=1}^n \frac{dF(X_i)}{F(V_i)-F(U_i-)}.\qquad(15)\]

This represents the conditional likelihood of the \(X_i\)’s given the observed values for the truncation couple \((U,V)\). In their seminal paper, (Efron and Petrosian 1999) proposed two different algorithms to maximize \(\mathcal{L}_{n,c}^{(dt)}\). These have been implemented in the functions efron.petrosian() and lynden() of the DTDA package; however, unlike the function shen(), they do not report the estimated sampling probabilities \(G_n(X_i)\), \(1\leq i\leq n\), which are of independent interest.

In order to illustrate the computation of the NPMLE under double truncation, consider the toy example in (Efron and Petrosian 1999); the data have been included in the objects X, U and V below.

    > X <- c(0.75, 1.05, 1.25, 1.5, 2.25, 2.4, 2.5)
    > U <- c(0.4, 0.3, 0.8, 0, 1.3, 1.1, 2.3)
    > V <- c(2, 1.4, 1.8, 2.3, 2.6, 3, 3.4)
    > library(DTDA)
    > shen(X, U, V, boot = F) -> res
    n.iterations 37 
    S0 9.708669e-07 
    events 7
    > cbind(time = res$time, sampl.prob = res$biasf, prob.mass = res$density, 
         + survival = res$survival)
         time sampl.prob prob.mass survival
    [1,] 0.75    0.44117   0.13714  0.86286
    [2,] 1.05    0.66830   0.09053  0.77234
    [3,] 1.25    0.74663   0.08103  0.69131
    [4,] 1.50    0.63816   0.09480  0.59650
    [5,] 2.25    0.26109   0.23172  0.36479
    [6,] 2.40    0.33170   0.18239  0.18239
    [7,] 2.50    0.33170   0.18239  0.00000

In the output above it is seen that the probability masses (third column) do not coincide with \(1/n=0.14286\). The \(X_i\)-values get a mass that is inversely proportional to the sampling probabilities \(G_n(X_i)\) (second column). Note that the function \(G_n(x)\) is non-monotone; this is a consequence of the fact that truncation occurs at both sides of the distribution. The survival probabilities \(S_n(X_i)=1-F_n(X_i)\) are reported in the last column of the output.

In some applications with doubly truncated data one will find that the NPMLE \(F_n\) does not exist or is not unique. The situation has been theoretically investigated by (Xiao and Hudgens 2019), who proved that a NPMLE exist and is unique if and only if the data constitute a strongly connected graph. We consider a new example to introduce and discuss this concept:

    > X <- c(0.75, 1.05, 1.25, 1.5, 2.25, 2.4, 2.5)
    > U <- c(0.4, 0.3, 1.1, 1.2, 1.3, 1.1, 2.3)
    > V <- c(2, 1.4, 1.8, 2.3, 2.6, 3, 3.4)
    > library(DTDA)
    > shen(X, U, V, boot = F) -> res
    n.iterations 1255 
    S0 9.990423e-07 
    events 7 

Note that the example coincides with the previous one but for the values \(U_3\) and \(U_4\), which have been modified. In this case, the number of iterations to convergence is much larger. The inspection of the solution reveals that values \(X_1\) and \(X_2\) get practically all the probability mass, the remaining five \(X_i\)-values being almost ignored:

    > cbind(time = res$time, sampl.prob = res$biasf, prob.mass = res$density, 
    + survival = res$survival)
         time sampl.prob prob.mass survival
    [1,] 0.75    0.00052   0.49875  0.50125
    [2,] 1.05    0.00052   0.49875  0.00250
    [3,] 1.25    0.64824   0.00040  0.00210
    [4,] 1.50    0.77190   0.00034  0.00176
    [5,] 2.25    0.41937   0.00062  0.00114
    [6,] 2.40    0.45578   0.00057  0.00057
    [7,] 2.50    0.45578   0.00057  0.00000

This degeneration of the solution can be explained graphically. In Figure 1 a directed graph for the dataset is provided. Convention is that data point \(i\) directly communicates with data point \(j\) if \(U_i\leq X_j\leq V_i\). The graph is connected, but it is not strongly connected in the sense of (Xiao and Hudgens 2019). For example, individual 1 is not reachable from individual 4. The application of Proposition 2 in the referred paper indicates that, indeed, a NPMLE does not exist. The graph is Figure 1 can be generated by running the following code lines, for which the package igraph (Csardi and Nepusz 2023) within R is needed:

    > au <- outer(X, U, ">=")
    > av <- outer(X, V, "<=") 
    > J <- t(au*av)
    > diag(J) <- 0
> library(igraph)
    > plot(graph.adjacency(J))
Figure 1: Graph for doubly truncated data. Toy example for which the Efron-Petrosian NPMLE does not exist.

The latter example somehow resembles the one in Section 3.1; here we have just added a sequence of right-truncation limits \(V_i\), \(1\leq i\leq n\). Indeed, there exists some relationship between the issue of holes and the non-existence or non-uniqueness of the NPMLE. Assume, for instance, that for a left-truncated sample one has \(\sum_{j=1}^n I(U_j\leq X_i)=1\) for some \(i\). Then, by force \(X_i\) is a hole (and \(X_i\) is the minimum \(X\)-value). Besides, the graph for the dataset is not strongly connected, because \(i\) is not reachable from any other point. Therefore, according to Proposition 1 in (Xiao and Hudgens 2019), the NPMLE of \(F\) does not exist or is not unique.

Finally we mention that, under double truncation, an example with multiple (indeed an infinite number of) NPMLE’s can be easily constructed. Assume that the \((X_i,U_i,V_i)\)’s are such that, for any couple of indexes \((i,j)\), \(I(U_i\leq X_j\leq V_i)\) is the Kronecker delta. Then, the conditional likelihood \(\mathcal{L}_{n,c}^{(dt)}\) is flat, and any distribution with mass concentrated at the \(X_i\)’s is a NPMLE for \(F\). Note that this example does not work for one-sided truncation, where the random sets \(\mathcal{B}\) are unbounded intervals.

Simulation study

In Section 3 several issues in the computation of the NPMLE under random truncation have been discussed. In this section we investigate through simulations how frequently such issues may appear. More specifically, for left-truncated data the addressed questions are when and where holes occur. This is interesting since, as mentioned, such holes imply a degeneration of the Lynden-Bell estimator. For doubly truncated data, simulation results on the possible non-existence and non-uniqueness of the Efron-Petrosian NPMLE are provided.

Holes with left-truncation

To simulate left-truncated data, pairs of values from \((U,X)\) are drawn and the truncation condition is assessed until a sample of size \(n\) is obtained. Recall that \(U \leq X\) is compulsory under left-truncation so, in the event that \(U > X\), the couple \((X,U)\) is dismissed. For each sample \((X_i,U_i)\), \(1\leq i\leq n\), a hole at \(X_i<\max_{1\leq j\leq n}X_j\) is identified by checking equality \(S_n(X_i)=0\), where \(S_n\) is the Lynden-Bell survival function. If this equality holds, there is a hole in the sample, and the hole is precisely located at \(X_i\). Table 1 provides the proportion of samples that contain a hole out of \(M=10000\) Monte Carlo trials. The sample size \(n\) ranges between \(20\) and \(100\). The target distribution is uniform on the unit interval, \(\mathcal{U}[0,1]\), while two truncation models are considered: \(\mathcal{U}[0,1]\) and \(Beta(2,1)\), where \(Beta(a,b)\) denotes the beta distribution with shape parameters \(a\) and \(b\). The percentage of samples with holes clearly decreases as the sample size increases. This was expected, according to Theorem 2 in (Strzalkowska-Kominiak and Stute 2010), which only requires the continuity of the involved variables.

Empirical probability of holes along \(10000\) Monte Carlo trials. The left-truncation time is \(U\), the target variable is \(X\) and the sample size is \(n\).
\(U\) \(X\) \(n=20\) \(n=30\) \(n=50\) \(n=75\) \(n=100\)
\(\mathcal{U}[0,1]\) \(\mathcal{U}[0,1]\) \(0.0515\) \(0.0356\) \(0.0191\) \(0.0133\) \(0.0090\)
\(Beta(2,1)\) \(\mathcal{U}[0,1]\) \(0.1166\) \(0.0895\) \(0.0561\) \(0.0437\) \(0.0336\)

An interesting question is how fast is the convergence to zero of the frequency of holes as the sample size increases. This was theoretically investigated by (Strzalkowska-Kominiak and Stute 2010) under right-truncation. Their Corollary 2 indicates that the convergence rate is faster than \(1/n\) under a support condition. To be explicit, let \([a_H,b_H]\) denote the support of a general continuous cdf \(H\), and let \(\Omega_n\) be the event ’the sample has a hole’. Then, Corollary 2 in (Strzalkowska-Kominiak and Stute 2010) ensures \(P(\Omega_n)=o(1/n)\) under \(a_G<a_F\) and \(b_G<b_F\), where \(F\) and \(G\) denote the cdfs of \(X\) and \(U\). Here we have conveniently re-written the support condition in the referred Corollary 2 (stated for right-truncation) for the left-truncation model, so the roles of \(F\) and \(G\) are interchanged.

In Figure 2 a Monte Carlo evaluation of \(P(\Omega_n)\) for three different models and several sample sizes is depicted; for comparison purposes, the values \(1/n\) are included. Let \(Exp(\lambda)\) denote the exponential distribution with mean \(1/\lambda\). When \(U \sim \mathcal{U}[0,1]\) and \(X \sim Exp(1)+0.25\) (Figure 2, left), the support condition in Corollary 2, (Strzalkowska-Kominiak and Stute 2010) is satisfied. In this case, the empirical rate of convergence of \(P(\Omega_n)\) to zero, in green, is much faster than \(1/n\), in red. When the support condition does not hold, the rate of convergence is not necessarily faster than \(1/n\), as seen in the central panel in Figure 2, where \(U \sim Beta(2,1)\) and \(X \sim \mathcal{U}[0,1]\), so \(a_G=a_F=0\) and \(b_G = b_F = 1\). In this case, the rate of convergence is exactly \(1/n\); this can be easily confirmed by dividing \(P(\Omega_n)\) by the inverse of the sample size. The rate \(P(\Omega_n)=O(1/n)\) is also obtained for \(U \sim \mathcal{U}[0,2]\) and \(X \sim Exp(1)\), where \(a_G = a_F\) (right panel of Figure 2). These results are not surprising; indeed, the rate \(1/n\) is known to hold for \(P(\Omega_n)\) in the particular case \(F=G\), regardless the support condition; see (Strzalkowska-Kominiak and Stute 2010), Corollary 3.

Figure 2: Empirical probability of holes along 10000 Monte Carlo trials (green dots) for different sample sizes n, and the corresponding value of 1/n (red dots). The simulation scenarios are: U \sim \mathcal{U}[0,1], X \sim Exp(1)+0.25 (left); U \sim Beta(2,1) and X \sim \mathcal{U}[0,1] (middle); and U \sim \mathcal{U}[0,2] and X \sim Exp(1) (right).

(Strzalkowska-Kominiak and Stute 2010) also obtained an upper bound for \(P(\Omega_n)\) under right-truncation. For exponentially distributed target and truncation variables, the bound is simply computed as \[n \int_0^1 \left[ 1- \frac{1}{\alpha} \left( 1-u^{\alpha} \right) u^{1-\alpha} \right]^{n-1} du-1,\qquad(16)\]where \(\alpha\) is the probability of no truncation. For \(U \sim Exp(2)\) and \(X \sim Exp(1)\), the Monte Carlo version of \(P(\Omega_n)\) and the aforementioned upper bound are depicted in the left panel of Figure 3. From this figure it is seen that the bound is very accurate for all the considered sample sizes; this complements the numerical evidence provided by (Strzalkowska-Kominiak and Stute 2010). On the other hand, the rate of convergece of \(P(\Omega_n)\) in Figure 3, left, is again \(1/n\).

In the particular case \(F=G\), (Strzalkowska-Kominiak and Stute 2010) proved that \(P(\Omega_n)\) is distribution-free. In Figure 3, right, this is verified numerically. Values were drawn from \(Exp(1)\), \(\mathcal{U}[0,1]\), \(Weibull(2,1)\) and \(Beta(2,2)\) for both truncation and target distributions. Here, \(Weibull(a,b)\) stands for the Weibull distribution with shape parameter \(a\) and scale parameter \(b\). The proportion of samples with holes were plotted for each sample size; the empirical probabilities obtained from the four different models almost overlap, supporting the distribution-free property of \(P(\Omega_n)\). On the other hand, the convergence rate of \(P(\Omega_n)\) in Figure 3, right, is \(1/n\), which, as mentioned, agrees with Corollary 3 in (Strzalkowska-Kominiak and Stute 2010).

Regarding the location of the holes, in the performed simulations they generally appeared within the three smallest or the three largest \(X\)-values. More precisely, most of the times the hole was the smallest or the (second) largest value of the target variable, particularly for large sample sizes. For small sample sizes, however, holes also appeared among the central \(X\)-values, but always with a very low frequency.

Figure 3: Left: empirical probability of holes along 10000 Monte Carlo trials (green dots), theoretical upper bound (black dots) for several sample sizes n, and the corresponding value of 1/n (red dots); simulation scenario: U \sim Exp(2) and X \sim Exp(1). Right: empirical probability of holes along 10000 Monte Carlo trials for the special case where the truncation and target variables follow the same distribution: Exp(1) (red dots), \mathcal{U}[0,1] (green), Weibull(2,1) (blue), and Beta(2,2) (yellow dots).

Issues with doubly truncated data

In order to simulate doubly truncated data we consider interval sampling. The length of the sampling interval is \(\varsigma\). The target variable \(X\) is uniformly distributed on the unit interval. For \(\rho>0\), we consider the following truncation variables: \(U \sim (1+\varsigma)\mathcal{U}[0,1]^{\rho} – \varsigma\) and \(V = U + \varsigma\). Then, the supports of \(U\) and \(V\) are \([-\varsigma,1]\) and \([0,1+\varsigma]\), respectively. Triplets \((X,U,V)\) violating condition \(U\leq X\leq V\) are discarded. The function \(G(x)=P(U\leq x\leq V)\) is flat when \(\rho=1\), so the sampling probability is constant in such a case. For other choices of \(\rho\) the sampling bias is not negligible (de Uña-Álvarez 2023).

As mentioned in Section 3.3, (Xiao and Hudgens 2019) provided in their Proposition 1 a necessary and sufficient (NS) condition for the existence and uniqueness of the Efron-Petrosian NPMLE. That condition requires a strongly connected status for the graph generated by the data points \((X_i,U_i,V_i)\), \(1\leq i\leq n\). A necessary (N) condition for the graph to be strongly connected was reported in (de Uña-Álvarez, Moreira, and Crujeiras 2021), Appendix A.2. This justs requires to check whether \(\min\{m_1,m_2\}>1\), where \(m_1=\min_{i}\sum_{j=1}^n I(U_j\leq X_i\leq V_j)\) and \(m_2=\min_{i}\sum_{j=1}^n I(U_i\leq X_j\leq V_i)\). Table 2 reports the proportion of simulated samples, among \(10000\) Monte Carlo trials, for which the necessary and the necessary and sufficient conditions for the existence and uniqueness of the NPMLE hold.

From Table 2 it is seen that the proportion of samples for which conditions N or NS hold increases with the sample size; that is, the larger the sample size, the larger the probability for the NPMLE to exist and be unique. Moreover, the proportion of samples for which the necessary and the necessary and sufficient condition hold approach each other as the sample size increases. This is relevant for a quick assessment of the existence and uniqueness of the NPMLE in practice (verification of the NS condition is more computationally demanding). Another feature which is evident from Table 2 is that, for each fixed value of \(\rho\), the issues with the NPMLE appear less frequently when enlarging the sampling interval. This agrees with the intuition that a wider interval should result in a more robust sampling information. Table 2 also suggests a rate \(O(1/n)\) for the probability of having issues in the sample, at least when \(\rho\leq 2\).

Conditions under which the graph generated by the data is strongly connected almost surely as \(n\rightarrow \infty\) have been theoretically investigated; see (Xiao and Hudgens 2019), Proposition 3. Let \(H\), \(F\) and \(Q\) denote the cdfs of \(U\), \(X\) and \(V\). Besides continuity on these three distributions and convexity of their supports, such conditions impose \(-\infty < a_H \leq a_F < b_F \leq b_Q < \infty\) (so some boundedness is required for the supports), and \(P(V-U > \epsilon) = 1\) for some \(\epsilon>0\). Interestingly, the latter condition holds for interval sampling. Whatever the case, the scenarios simulated in this section fulfill the conditions in the referred Proposition 3. Therefore, results in Table 2 are aligned with the available theory, providing on the other hand a numerical support.

Proportion of samples among \(10000\) Monte Carlo trials for which the necessary condition (N) and the necessary and sufficient condition (NS) for the existence and uniqueness of the Efron-Petrosian NPMLE hold. The sample size is \(n\), the length of the sampling interval is \(\varsigma\), and \(\rho\) is a shape parameter for the sampling probability \(G(x)\).
\(n=20\) \(n=30\) \(n=50\) \(n=75\) \(n=100\)
N NS N NS N NS N NS N NS
\(\rho=1\) \(\varsigma=1\) \(0.907\) \(0.895\) \(0.934\) \(0.930\) \(0.963\) \(0.962\) \(0.974\) \(0.973\) \(0.982\) \(0.982\)
\(\varsigma=\frac{1}{2}\) \(0.821\) \(0.701\) \(0.873\) \(0.841\) \(0.926\) \(0.919\) \(0.951\) \(0.948\) \(0.959\) \(0.957\)
\(\rho=2\) \(\varsigma=1\) \(0.853\) \(0.835\) \(0.894\) \(0.884\) \(0.923\) \(0.918\) \(0.945\) \(0.943\) \(0.954\) \(0.953\)
\(\varsigma=\frac{1}{2}\) \(0.751\) \(0.593\) \(0.820\) \(0.767\) \(0.878\) \(0.862\) \(0.917\) \(0.908\) \(0.930\) \(0.926\)
\(\rho=6\) \(\varsigma=1\) \(0.796\) \(0.754\) \(0.829\) \(0.808\) \(0.860\) \(0.848\) \(0.886\) \(0.877\) \(0.900\) \(0.893\)
\(\varsigma=\frac{1}{2}\) \(0.665\) \(0.455\) \(0.747\) \(0.650\) \(0.813\) \(0.783\) \(0.841\) \(0.827\) \(0.860\) \(0.848\)

Real data illustrations

Channing House data

Channing House is a retirement center located in Palo Alto. The ages at death for members of this community have been frequently used to illustrate the issues of truncation and censoring (Klein and Moeschberger 2003). The target variable \(X\) is age at death. The left-truncation variable \(U\) is the age at entry into the retirement center; note that subjects who die at an early age are systematically excluded from the analysis. On the other hand, some ages are right-censored because the individual was still alive at the end of follow-up. The information is available from channing, package KMsurv (Klein, Moeschberger, and Yan 2012).

We consider the data corresponding to the \(n=97\) male residents of Channing House. The NPMLE of the survival function is the solid line in Figure 4; the result is unpractical. In order to get an informative curve, the NPMLE of the conditional survival \(S(x|x_0)=P(X>x|X>x_0)\) with \(x_0=781\) months was computed, see Section 3.1; this is the dashed line in Figure 4. In this case, one can clearly seen how the survival decreases along time, although there exists a small fraction of probability mass at infinite due to the right-censoring effects. The choice for \(x_0\) was suggested by the fact that there exists a unique hole for this dataset, and it is precisely located at \(781\) months. The results were obtained from the following code lines; the age at death was artificially increased in \(0.5\) months because the function survfit() is unable to cope with ties between left-truncation and survival times.

> library(KMsurv)
> data(channing)
> library(survival)
> res <- survfit(Surv(ageentry, age+.5, death) ~ 1, data = channing, 
     + subset = {gender==1})
> plot(res, xlim = c(720, 1153.5), ylim=c(0, 1), conf.int = F, 
     + xlab = "age at death (months)", ylab = "survival probability")
> cumprod(1 - res$n.event[-(1:2)]/res$n.risk[-(1:2)]) -> Scond
> lines(res$time[-(1:2)], Scond, type = "s", lty = 2)
Figure 4: NPMLE of the survival function (solid line), and conditional NPMLE given survival beyond 781 months (dashed line), for the Channing House data, male cohort.

Childhood Cancer data

Consider all the children diagnosed with nervous system tumour between 1 January 1999 and 31 December 2003 in the region of North Portugal, according to the information provided by RORENO service, Instituto Português de Oncologia; see (de Uña-Álvarez, Moreira, and Crujeiras 2021) for further description. The sample size is \(n=94\), and the target variable \(X\) is the age at diagnosis (in days). The data are provided in ChildCancer, ICCC group 3, within DTDA package. The interval sampling results in double truncation on \(X\); the right-truncation time \(V\) is age (in days) at 31 December 2003, while \(U=V-1825\). In order to take the potential sampling bias into account, we consider the Efron-Petrosian NPMLE to evaluate \(F\), the cdf of \(X\).

Unfortunately, the Efron-Petrosian NPMLE has non-existence or non-uniqueness issues in this application. Specifically, as noted in (de Uña-Álvarez 2020), no \(X\)-value other than \(X_{94}=\max_{1\leq i\leq n}X_i\) falls in the truncation interval \([U_{94},V_{94}]=[5318,7143]\) and, therefore, the graph for this dataset is not strongly connected. The strongly connected subgraphs, for which existence and uniqueness of the NPMLE is ensured (Xiao and Hudgens 2019), can be easily obtained by using the package igraph as follows:

> library(DTDA)
> CC <- na.omit(ChildCancer[, 1:4])
> U <- CC$U[CC$ICCGroup == 3]
> V <- CC$V[CC$ICCGroup == 3]
> X <- CC$X[CC$ICCGroup == 3]
> au <- outer(X, U, ">=")
> av <- outer(X, V, "<=")
> J <- t(au*av)
> library(igraph)
> SCG <- clusters(graph.adjacency(J), mode = "strong")  
> SCG$membership  
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[77] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

From the output above it is seen that one gets a strongly connected graph when removing case \(94\) from the sample. The corresponding Efron-Petrosian NPMLE, which can be regarded as conditional on \(X<X_{94}=5438\), is reported in Figure 5. The ordinary empirical cdf is included for comparison purposes. The plot was generated by running the following code lines:

> res <- shen(X[-94], U[-94], V[-94], boot = F)
> plot(res$time, res$cumulative.df, type = "s", xlab = "age at diagnosis (days)", 
     + ylab = "cumulative probability")
> lines(res$time, ecdf(res$time)(res$time), type = "s", lty = 2)
Figure 5: Efron-Petrosian NPMLE (solid line) and ordinary empirical cdf (dashed line) for the Childhood Cancer data.

It becomes clear from Figure 5 that double truncation induces some oversampling of relatively large ages at diagnosis; this is corrected by the Efron-Petrosian estimator. From a formal viewpoint, the null hypothesis of no sampling bias (that is, constant \(G\)) can be tested by comparing the Efron-Petrosian NPMLE to the ordinary empirical cdf of the \(X_i\)’s. The Kolmogorov-Smirnov type test statistic proposed by (de Uña-Álvarez 2023) was applied; the \(P\)-value based on 500 bootstrap resamples was \(0.114\), which entails some evidence against the null hypothesis. This is in well agreement with Figure 5.

Main conclusions

In this paper the issues of censoring and truncation, which are ubiquitous in the analysis of survival data, have been revisited. With the focus on the possible degeneration, non-uniqueness or non-existence of the NPMLE, illustrative examples in R have been provided. Therefore, the contents are useful for a better understanding of the problems underlying estimation from survival data. We have indicated how these problems can be detected and solved, if possible. The included simulation studies give new insights into how frequently the aforementioned issues may appear. It has been seen that the sampling effort, as well as the particular truncation pattern, will largely determine the probability of getting deteriorated samples.

Throughout the paper we have considered nonparametric estimation of the lifetime distribution. Of course, corrections for censoring and truncation are also important in regression analysis, among other settings. Cox partial likelihood (Cox 1972) allows for left-truncation and right-censoring by a proper definition of the risk sets. The situation is not so automatic in other sampling designs. For instance, weighted approaches for Cox regression have been proposed under double truncation, where the weights are given by the jumps of the (marginal) Efron-Petrosian NPMLE (de Uña-Álvarez, Moreira, and Crujeiras 2021). The issue of efficiency in the estimation of the regression parameters remains, however, generally unsolved at the time of writing.

This manuscript is not intended to be exhaustive regarding the sampling issues that affect survival data, nor to provide access to all the related literature. That would have implied, by force, a much longer and deeper review. Rather, we have pursued an intuitive, quick and practical guide to move along nonparametric estimation techniques under censoring and/or truncation, which are the most common issues when dealing with lifetimes. We frankly hope to have reached such goal.

Acknowledgments

Work supported by the Grant PID2020-118101GB-I00, Ministerio de Ciencia e Innovación (MCIN/ AEI /10.13039/501100011033).

Acerca de los autores

Jacobo de Uña-Álvarez is Professor at the Department of Statistics and Operations Research, Universidade de Vigo. He is affiliated to SiDOR research group and CINBIO, same institution. His educational background includes a BSc in Mathematics (1995) and a PhD in Mathematical Statistics (1998), both at the Universidade de Santiago de Compostela. Jacobo’s main research area is nonparametric statistics and goodness-of-fit tests, and their application to Survival Analysis and high-dimensional data, including multiple testing.
Adrián Lago is predoctoral student at the Department of Statistics and Operations Research, Universidade de Vigo. He is affiliated to SiDOR research group and CINBIO, same institution. His research contract is supported by Ministerio de Ciencia e Innovación, Grant PRE2021-096943. He got a BSc in Mathematics at the Universidade de Santiago de Compostela in 2020, and a MSc in Statistics at the Universidade Vigo in 2022. Adrián’s PhD project is focused on randomly truncated data. He has already finished and submitted a paper on the two-sample problem under left-truncation.

References

Breslow, N., and J. Crowley. 1974. “A Large Sample Study of the Life Table and Product Limit Estimates Under Random Censorship.” The Annals of Statistics 2 (3): 437–53.
Cox, D. R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34 (2): 187–220.
Cox, D. R., and D. Oakes. 1984. Analysis of Survival Data. Chapman; Hall/CRC.
Csardi, G., and T. Nepusz. 2023. Igraph: Network Analysis and Visualization. Vol. Complex Systems. https://CRAN.R-project.org/package=igraph.
de Uña-Álvarez, J. 2020. “Nonparametric Estimation of the Cumulative Incidences of Competing Risks Under Double Truncation.” Biometrical Journal 62 (3): 852–67.
———. 2023. “Testing for an Ignorable Sampling Bias Under Random Double Truncation” arXiv:2301.036 98.
de Uña-Álvarez, J., C. Moreira, and R. M. Crujeiras. 2021. The Statistical Analysis of Doubly Truncated Data: With Applications in r. Wiley.
Efron, B. 1967. “The Two-Sample Problem with Censored Data.” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 4: 831–52.
Efron, B., and V. Petrosian. 1999. “Nonparametric Methods for Doubly Truncated Data.” Journal of the American Statistical Association 94 (447): 824–34.
Fleming, T. R., and D. P. Harrington. 1981. Counting Processes and Survival Analysis. Wiley.
Gentleman, R., and A. Vandal. 2022. Icens: NPMLE for Censored and Truncated Data. https://bioconductor.org/packages/release/bioc/html/Icens.html.
Gómez, G., M. L. Calle, R. Oller, and K. Langohr. 2009. “Tutorial on Methods for Interval-Censored Data and Their Implementation in r.” Statistical Modelling 9 (4): 259–97.
Heuchenne, C., J. de Uña-Álvarez, and G. Laurent. 2020. Estimation from cross-sectional data under a semiparametric truncation model.” Biometrika 107 (2): 449–65.
Kalbfleisch, J. D., and R. L. Prentice. 1980. The Statistical Analysis of Failure Time Data. Wiley.
Kaplan, E. L., and P. Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53 (282): 457–81.
Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. (2nd Edition). Springer.
Klein, J. P., M. L. Moeschberger, and J. Yan. 2012. KMsurv: Data Sets from Klein and Moeschberger (1997), Survival Analysis. https://CRAN.R-project.org/package=KMsurv.
Lawless, J. F. 1982. Statistical Models and Methods for Lifetime Data. Wiley.
Lynden-Bell, D. 1971. A method of allowing for known observational selection in small samples applied to 3CR quasars.” Monthly Notices of the Royal Astronomical Society 155 (1): 95–118.
Moreira, C., J. de Uña-Álvarez, and R. M. Crujeiras. 2022. DTDA: Doubly Truncated Data Analysis. https://CRAN.R-project.org/package=DTDA.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Satten, G. A., and S. Datta. 2001. “The Kaplan–Meier Estimator as an Inverse-Probability-of-Censoring Weighted Average.” The American Statistician 55 (3): 207–10.
Shen, P. S. 2010. “Nonparametric Analysis of Doubly Truncated Data.” Annals of the Institute of Statistical Mathematics 62 (5): 835–53.
Strzalkowska-Kominiak, E., and W. Stute. 2010. “On the Probability of Holes in Truncated Samples.” Journal of Statistical Planning and Inference 140 (6): 1519–28.
Sun, J. 2006. The Statistical Analysis of Interval-Censored Failure Time Data. Springer.
Therneau, T. M. 2022. Survival: Survival Analysis. https://CRAN.R-project.org/package=survival.
Turnbull, B. W. 1974. “Nonparametric Estimation of a Survivorship Function with Doubly Censored Data.” Journal of the American Statistical Association 69 (345): 169–73.
———. 1976. “The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data.” Journal of the Royal Statistical Society. Series B (Methodological) 38 (3): 290–95.
Wang, M. C. 1991. “Nonparametric Estimation from Cross-Sectional Survival Data.” Journal of the American Statistical Association 86 (413): 130–43.
Woodroofe, M. 1985. Estimating a distribution function with truncated data.” The Annals of Statistics 13 (1): 163–77.
Xiao, J., and M. G. Hudgens. 2019. On nonparametric maximum likelihood estimation with double truncation.” Biometrika 106 (4): 989–96.

Más BEIO

Uso de app’s para recogida de datos en la estadística oficial

Los institutos oficiales de estadística europeos han realizado un gran esfuerzo durante los últimos años para adaptarse al avance de las nuevas tecnologías estableciendo un nuevo canal de recogida de datos basados en cuestionarios web de auto-cumplimentación. Eustat, el Instituto Vasco de Estadística, lleva trabajando desde el año 2017 en el desarrollo de app’s para teléfonos móviles.

New advances in set estimation

Some recent advances in Set Estimation, from 2009 to the present, are discussed. These include some new findings, improved convergence rates, and new type of sets under study. Typically, the theoretical results are derived under some shape constrains, such as r-convexity or positive reach, which are briefly reviewed, together with some other new proposals in this line. Known constraints on the shape, such as r-convexity and positive reach, as well as recently introduced ones are discussed. The estimation of the home-range of a species, which is closely related to set estimation, is also explored, and statistical problems on manifolds are covered. Commentary and references are provided for readers interested in delving deeper into the subject.

Problemas de Elección Social en el Contexto de los Problemas de Asignación

En este trabajo proponemos un método de elección social basado en el problema de asignación de la investigación de operaciones, en particular consideramos un proceso de votación donde los votantes enumeran según sus preferencias a cada uno de los n candidatos disponibles, luego entonces nosotros construimos una matriz de asignación donde las “tareas” por realizar son los puestos 1,2,…n; siendo el puesto número 1 el principal y el n-ésimo el de menor jerarquía. El valor de la posición ij de la matriz se obtiene considerando el número de veces que el candidato i fue seleccionado para “ocupar” el puesto j. Así obtenemos una matriz de rendimiento y se busca la mejor asignación. Usamos bases de datos obtenidos de algunos procesos de elección en los Estados Unidos de América y comparamos los resultados que se obtendrían con nuestra propuesta, adicionalmente se construyen ejemplos para demostrar que nuestro método no es equivalente a los métodos de Borda, Condorcet y mayoría simple.

Técnicas de diferenciabilidad con aplicaciones estadísticas

En esta tesis doctoral se han explorado diferentes aplicaciones del conocido Método delta (Capítulo 2). En concreto, se han calculado las derivadas de Hadamard direccional de diferentes funcionales de tipo supremo en diferentes contextos. A continuación, se han investigado aplicaciones a inferencia no-paramétrica (Capítulo 3), a los problemas de dos muestras u homogeneidad (Capítulo 4) y a la metodología de k-medias (Capítulo 5).

Relevance and identification of biases in statistical graphs by prospective Primary school teachers

El enorme poder de visualización de la información basada en datos representada mediante gráficos estadísticos, hace especialmente interesante el estudio del entendimiento de dicha información por parte de los ciudadanos que se enfrentan a ella día a día. Al mismo tiempo, en el ámbito de didáctica de la estadística se investiga para conocer cómo se produce la transferencia de conocimiento estadístico en la escuela. Así, aunando ambos fines, el propósito del presente estudio exploratorio es observar el grado de alfabetización estadística que poseen los futuros maestros en base a la evaluación de los gráficos estadísticos, frecuentemente utilizados en los medios de comunicación, y la identificación de los sesgos que debido a su visualización selectiva de los datos a veces estos presentan. Los resultados muestran, de forma implícita, una aceptable identificación de convenios para cada gráfico estudiado mientras que evidencia una muy pobre identificación de sesgos o errores en dichas imágenes. Con ello se deduce una necesidad de refuerzo educativo en cuanto a la enseñanza y aprendizaje de la estadística, concretamente, en los estudiantes del Grado de Educación Primaria para, mediante ello, conseguir ciudadanos con una alfabetización estadística funcional desde la escuela.

Learning to build statistical indicators from open data sources

The paper presents the building of several statistical indicators from different Open Data sources, all of them using a common methodological approach to estimate changes across time. The purpose is to show the problems that must be addressed when using these data and to learn about the different ways to cope with them, according to the type of information, the data available and the aim of the specific indicator. The raw data come from diverse secondary sources that make it publicly accessible: traffic sensors, multichannel citizen attention services, Twitter messages and scraped data from a digital newspapers’ library website. The built indicators may be used as proxies or lead indicators for economic activities or social sentiments.