Example of using R functions related to emplik (empirical likelihood) package Example 1. Find the upper and lower confidence limit of a Wilks confidence interval related to the Kaplan-Meier estimator We can use the function el.cen.EM from the emplik package to accomplish this. In fact el.cen.EM is designed to test the hypothesis (via likelihood ratio test) Ho: \int g(t) dF(t) = mu; vs. Ha: \int g(t) dF(t) != mu where g(t) is a user supplied function. Depending on the g(t) we supply, this hypothesis can be the test about the Kaplan-Meier mean [when g(t) = t], or about the survival probability at time t0 [when g(t) = I[t <= t0] ] or many others. Let us see a real example: we use the myeloma data set that comes with emplik package data(myeloma) survtimes <- myeloma[,1] censtatus <- myeloma[,2] We shall test and construct confidence interval for F(10), so define myfun1 <- function(t){ as.numeric(t <= 10) } Now a test based on the empirical likelihood ratio for Ho: F(10) = 0.2 el.cen.EM(fun=myfun1, x=survtimes, d=censtatus, mu=0.2) among the output, you see a Pval (that is the P-value) of 0.3103897 so the conlusion is "not to reject the null hypothesis of F(10)=0.2". If you want the (NPMLE) estimate of F(10), look in the output for "funMLE" You will see a MLE of 0.2526136 (this is just the Kaplan-Meier at t=10) There is another function el.cen.EM2 in the emplik package which is very similar. The function el.cen.EM is designed for a single parameter, and el.cen.EM2 can handle multiple parameters. el.cen.EM may be a bit faster, while el.cen.EM2 may be a bit more robust. For the subtle differences, please see the help pages in R. Now we can further find the 95% confidence interval for F(10). For this purpose we need to define a function that just returns the -2LLR. The first input of this function must be the value to be tested. myloglikR1 <- function(mu0, x, d) { myfun1 <- function(t){as.numeric(t<=10)} el.cen.EM(fun=myfun1, x=x, d=d, mu=mu0)$"-2LLR" } The real search for the lower(upper) confidence bound is done by the function findL (findU) findL(fun=myloglikR1, MLE=0.25, x=survtimes, d=censtatus) [1] 0.1566809230 0.0000000001 3.8399999183 findU(fun=myloglikR1, MLE=0.25, x=survtimes, d=censtatus) [1] 0.3684037790 0.0000000001 3.8399999664 There are 3 output values, the first one is the lower(upper) confidence bound, the second/third is the final step value and the final -2LLR value. So the 95% confidence interval for F(10) is [0.156680923, 0.368403779] If you want the 90% confidence interval instead, do this findL(fun=myloglikR1, MLE=0.25, x=survtimes, d=censtatus, level=2.705543) findU(fun=myloglikR1, MLE=0.25, x=survtimes, d=censtatus, level=2.705543) Finally, the functions findL and findU and findUL are here: (not inside the emplik package yet) findU <- function(step=0.1, initStep=0, fun, MLE=0, level=3.84, ...) { value <- 0 Ubeta <- MLE + initStep for( i in 1:8 ) { while(value < level) { Ubeta <- Ubeta + step value <- fun(Ubeta, ... ) } Ubeta <- Ubeta - step step <- step/10 value <- fun(Ubeta, ... ) } return( c(Ubeta, step, value) ) } > findL function(step=0.1, initStep=0, fun, MLE=0, level=3.84, ...) { #### This function try to find the 1 dim confidence interval #### (the lower confidence limit only) #### the fun should be the -2loglik function with 1 parameter #### (the other parameters needs to be maxed over) #### The initStep is provided to save some initial steps, i.e. if you know #### approximate where the confidence bound is, (how far it is from the MLE) #### then you can start search from there, instead of starting from MLE. #### You should always use a lower bound for the initStep (a conservative value) #### as the search is from inside the confidence interval outwards. value <- 0 Lbeta <- MLE - initStep for( i in 1:8 ) { while(value < level) { Lbeta <- Lbeta - step value <- fun(Lbeta, ... ) } Lbeta <- Lbeta + step step <- step/10 value <- fun(Lbeta, ... ) } return( c(Lbeta, step, value) ) } And a code that combine the above two to find both upper and lower confidence bound: findUL <- function(step=0.1, initStep=0, fun, MLE, level=3.84, ...) { #### This function trys to find the 1 dim confidence interval #### (the lower/upper confidence limit). #### The fun should be the -2loglik function with 1 parameter #### (the other parameters needs to be maxed over) #### The initStep is provided to save some initial steps, i.e. if you know #### approximate where the confidence bounds are, (how far it is from the MLE) #### then you can start search from there, instead of starting from MLE. #### You should always use a conservative value for the initStep #### as the search is from inside the confidence interval outwards. value <- 0 step1 <- step Lbeta <- MLE - initStep for( i in 1:8 ) { while(value < level) { Lbeta <- Lbeta - step1 value <- fun(Lbeta, ... ) } Lbeta <- Lbeta + step1 step1 <- step1/10 value <- fun(Lbeta, ... ) } value1 <- value value <- 0 Ubeta <- MLE + initStep for( i in 1:8 ) { while(value < level) { Ubeta <- Ubeta + step value <- fun(Ubeta, ... ) } Ubeta <- Ubeta - step step <- step/10 value <- fun(Ubeta, ... ) } return( list(Low=Lbeta, Up=Ubeta, FstepL=step1, FstepU=step, Lvalue=value1, Uvalue=value) ) } ======================================================================== In the above example, the Wald confidence interval can at least be computed easily, with the Greenwood formula to estimate the variance. The next example then is almost impossible to get a Wald confidence interval. But the Wilks confidence interval is as easy as in example one. Example 2. Find the upper and lower confidence limit of the residual median, where the estimator is based on the Kaplan-Meier estimator. [residual median? or median residual?] First the definition: The residual median at a given time t0 is the number "theta" that solve the equation 1-F(t0 + theta ) ---------------- = 0.5 1-F(t0) The estimator of theta can be obtained by (a) replace F by the Kaplan-Meier in the above and (b) solve for theta. Notice the variance of the estimator of theta is very complicated, and even if you get a theoretical expression for it, it is of not much use, due to the fact that it involves density function, a quantity hard to estimate. To use the Wilks method, we shall re-cast the problem into a testing hypothesis problem. Notice that a theta solves the above equation iff it solves \int g_{theta} (t) dF(t) = 0.5 where g_{theta} (t) = I[t <= t0+theta] - 0.5*I[t <= t0] (simple algebraic manipulations) and recall this hypothesis can be handled by the el.cen.EM function. R-code: library(survival) data(cancer) library(emplik) cancertimes <- cancer$time cancerstatus <- cancer$status -1 This estimates the median and mean residual time at t0=365.25 days MMRtime(x=cancertimes, d=cancerstatus, age=365.25)$MeanResidual [1] 275.9997 $MedianResidual [1] 258.75 Now the test that theta = 258 mygfun22 <- function(s, age, Mdage) {as.numeric(s<=(age+Mdage))-0.5*as.numeric(s<=age)} el.cen.EM(x=cancertimes, d=cancerstatus, fun=mygfun22, mu=0.5, age=365.25, Mdage=258)$"-2LLR" If we are lazy, we can just repeat the last line but change 258 to something else, until you get "-2LLR" as close to as possible (but not exceed 3.84) to find the 95% confidence interval. Or do it automatically as follows. To find the upper and lower confidence limit by findL/findU, we need to write a function that returns the "-2LLR" and the first input variable is the value to be tested. myloglikR2 <- function(mu0, x, d) { mygfun22 <- function(s, age=365.25, Mdage=mu0){as.numeric(s<=(age+Mdage))-0.5*as.numeric(s<=age)} el.cen.EM(x=x, d=d, fun=mygfun22, mu=0.5)$"-2LLR" } Finally, we may do this (but it took too long...slow search algorithm the default search step is too small.) takes too long: findL(fun=myloglikR2, MLE=258, x=cancertimeis, d=cancerstatus, level=2.705543) Better to do this (with bigger steps, search is a bit faster) findL(step=5,fun=myloglikR2, MLE=258, x=cancertimes, d=cancerstatus, level=2.705543) [1] 184.75000000 0.00000005 2.50369332 findU(step=5,fun=myloglikR2, MLE=258, x=cancertimes, d=cancerstatus, level=2.705543) [1] 321.75000000 0.00000005 2.42779336 So, The 90% confidence interval is [ 184.750, 321.750 ]. If the step is too large, the hypothesis to be tested may be out of the range for el.cen.EM to handle and may stop computing or too slow. (here use el.cen.EM2 may be more robust) Notice the final level reached is 2.503 or 2.427 and somewhat away from 2.7055. THis is normal, since this is a discrete case when testing the median. (or any quantiles) ======================================================================== Example 3: The folowing is another example of how the code findL and findU works with parametric MLE: (the parametric regression of survreg( ) and getting the confidence interval for the slope in the regression) > findL(fun=loglikfun, MLE=-1.255066, y=myel$dur, d=myel$status, x1=myel$treat, x2=myel$renal) [1] -2.7499611050 0.0000000001 3.8399999981 So, the lower 95% confidence bound is -2.74996 The two functions involved: findL is general (as above), loglikfun is data specific. The MLE value can be obtained by first run a survreg, and getting the estimate, it is not critical, any value near -1.255 will work. > loglikfun function(beta, y, d, x1, x2) { loglikmax <- survreg(Surv(y,d)~ x1 + x2)$loglik[2] loglikH0 <- survreg(Surv(y,d)~ offset(beta*x1)+ x2)$loglik[2] return( -2*(loglikH0 - loglikmax) ) } Assumeing of course the data frame myel is already loaded there. ======================================================================== Now getting the interval for both regression slopes: > findU(fun=loglikfun2, MLE=-4.18, level=2.7, y=myel$dur, d=myel$status, x1=myel$treat, x2=myel$renal) [1] -3.1652483440 0.0000000001 2.6999999991 > findL(fun=loglikfun2, MLE=-4.18, level=2.7, y=myel$dur, d=myel$status, x1=myel$treat, x2=myel$renal) [1] -5.2571687790 0.0000000001 2.6999999965 > > findU(fun=loglikfun, MLE=-1.2, level=2.7, y=myel$dur, d=myel$status, x1=myel$treat, x2=myel$renal) [1] -0.2800085190 0.0000000001 2.6999999965 > findL(fun=loglikfun, MLE=-1.2, level=2.7, y=myel$dur, d=myel$status, x1=myel$treat, x2=myel$renal) [1] -2.4572724310 0.0000000001 2.6999999991 > > loglikfun function(beta0, y, d, x1, x2) { loglikmax <- survreg(Surv(y,d)~ x1 + x2)$loglik[2] loglikH0 <- survreg(Surv(y,d)~ offset(beta0*x1)+ x2)$loglik[2] return( -2*(loglikH0 - loglikmax) ) } > > loglikfun2 function(beta0, y, d, x1, x2) { loglikmax <- survreg(Surv(y,d)~ x1 + x2)$loglik[2] loglikH0 <- survreg(Surv(y,d)~ x1 + offset(beta0*x2))$loglik[2] return( -2*(loglikH0 - loglikmax) ) } > ======================================================================= Example 4. Test the equality of two medians with right censored data. The emplik package contains a function ROCnp( ) that can be used to test the equality of two medians from two independent samples. library(survival) times1 <- cancer$time[cancer$sex==1] status1 <- cancer$status[cancer$sex==1] -1 times2 <- cancer$time[cancer$sex==2] status2 <- cancer$status[cancer$sex==2] -1 ROCnp(t1=times1, d1=status1, t2=times2, d2=status2, b0=0.5, t0=0.5)$-2LLR [1] 11.02725 $cstar [1] 312.105 which lead us to reject the null hypothesis of equal median ( -2LLR larger than 3.84) If we want to construct a confidence interval for the RATIO of the two medians, we need to work harder. (no existing code) But the idea is similar: Let us test the hypothesis that m1/m2 = 0.6 where m1 is the median for male and m2 is the median for female. This is same as testing m1 = 0.6*m2 = a for some a, or m1=a, 0.6*m2=a; or m1=a, m2=a/0.6 . We shall modify the code of ROCnp to accomplish this. myROCnp(t1, d1, t2, d2, b0=0.5, t0=0.5) > myROCnp function (t1, d1, t2, d2, b0, t0) { if (length(b0) != 1) stop("check length of b0") if (length(t0) != 1) stop("check length of t0") if (b0 >= 1 | b0 <= 0) stop("check the value of b0") if (t0 >= 1 | t0 <= 0) stop("check the value of t0") tempnp2 <- WKM(x = t2, d = d2) place2 <- sum(tempnp2$surv >= t0) c2 <- tempnp2$times[place2] tempnp1 <- WKM(x = t1, d = d1) place1 <- sum(tempnp1$surv > b0) c1 <- tempnp1$times[place1] if (c2 <= c1) c1 <- tempnp1$times[place1 + 1] llr <- function(const, t1, d1, t2, d2, b0, t0) { npllik1 <- el.cen.EM2(x = t1, d = d1, fun = function(x, theta) { as.numeric(x <= theta) }, mu = 1 - b0, theta = const)$"-2LLR" npllik2 <- el.cen.EM2(x = t2, d = d2, fun = function(x, theta) { as.numeric(x <= theta/0.6) #### here is the ratio }, mu = 1 - t0, theta = const)$"-2LLR" return(npllik1 + npllik2) } temp <- optimize(f = llr, lower = min(c2, c1), upper = max(c2, c1), t1 = t1, d1 = d1, t2 = t2, d2 = d2, b0 = b0, t0 = t0) cstar <- temp$minimum val <- temp$objective list(-2LLR = val, cstar = cstar) } ==================================================================== Due to the discreteness of median, the optimize( ) function may not always find the min. An exhaust search can be done: [slower but always find the min] ROCnp2 <- function (t1, d1, t2, d2, b0, t0) { if (length(b0) != 1) stop("check length of b0") if (length(t0) != 1) stop("check length of t0") if (b0 >= 1 | b0 <= 0) stop("check the value of b0") if (t0 >= 1 | t0 <= 0) stop("check the value of t0") tempnp2 <- WKM(x = t2, d = d2) place2 <- sum(tempnp2$surv >= t0) c2 <- tempnp2$times[place2] tempnp1 <- WKM(x = t1, d = d1) place1 <- sum(tempnp1$surv > b0) c1 <- tempnp1$times[place1] if (c2 <= c1) {c1 <- tempnp1$times[place1 + 2] c2 <- tempnp2$times[place2 - 1] } llr <- function(const, t1, d1, t2, d2, b0, t0) { npllik1 <- el.cen.EM(x = t1, d = d1, fun = function(x, theta) { as.numeric(x <= theta) }, mu = 1 - b0, theta = const)$"-2LLR" npllik2 <- el.cen.EM(x = t2, d = d2, fun = function(x, theta) { as.numeric(x <= theta) }, mu = 1 - t0, theta = const)$"-2LLR" return(npllik1 + npllik2) } lower <- min(c1, c2) upper <- max(c1, c2) timesALL <- c(tempnp2$times, tempnp1$times) midpts <- timesALL[(timesALL>lower) & (timesALL ROCnp3(t1=times1, d1=status1, t2=times2, d2=status2, t0=200) $-2LLR [1] 8.913334$cstar [1] 0.3200662 $Prob1 [1] 0.3926928$Prob2 [1] 0.2054065 ==================================================================== Some thoughts when testing the ratio of two probabilities or medians. If the actual ratio is further away from 1 then the value being tested (as in Ho), then we may search within the two prob/medians. Since we have to push the two together to make them equal or having the ratio stipulated in Ho. (i.e. if the two MLE's has ratio 0.5, and we are testing if the ratio is equal to 0.6; Or if the two MLE's has ratio 1.4 and we are testing Ho ratio = 1.2) We know where to search the common value. As an example, in the above ROCnp3 example, the actual ratio of probability is 0.2054/0.3927=0.523, and our Ho stipulates a ratio of 1. Therefore we need only to search with the two MLEs and the final common value is 0.320. When the value tested (in Ho) is further away from 1 compared to the ratio of MLE's, then we must search outside the interval formed by two MLE's. How much outside, will depends on the actual value to be tested. If we are testing ratio equal to 2, then we should search from value a and value b, where a = larger of MLEs/2, b= smaller of MLEs x2. etc. In any case, the search of probability should be within [0, 1] interval. =========================================================================== A final note: When define median, it may be better to use a "smoothed" indicator function. Due to discreteness nature of the Kaplan-Meier estimator. The median may be non-unique. (even after the smoothing) Here is the indicator function and 2 smoothed version of it. myfun6 <- function(x, theta) {as.numeric(x <= theta)} myfun7 <- function(x, theta=0, eps=0.2) { if(eps <= 0) stop("eps must > 0") u <- (x-theta)/eps return( pmax(0, pmin(1-u, 1)) ) } myfun5 <- function(x, theta=0, eps=0.1) { if(eps <= 0) stop("eps must > 0") u <- (x-theta)*sqrt(5)/eps INDE <- (u < sqrt(5)) & (u > -sqrt(5)) u[u >= sqrt(5)] <- 0 u[u <= -sqrt(5)] <- 1 y <- 0.5 - (u - (u)^3/15)*3/(4*sqrt(5)) u[ INDE ] <- y[ INDE ] return(u) }