Examples of using R functions related to emplik (empirical likelihood) package (V1.0) Example 1. Find the upper and lower confidence limit of a Wilks confidence interval related to the Kaplan-Meier estimator. We can use the function el.cen.EM from the emplik package to accomplish this. In fact el.cen.EM is designed to test the hypothesis (via empirical likelihood ratio test) Ho: \int g(t) dF(t) = mu; vs. Ha: \int g(t) dF(t) != mu where g(t) is a user supplied function. Depending on the g(t) we supply, this hypothesis can be the test about the Kaplan-Meier mean [when g(t) = t], or about the survival probability at time t0 [when g(t) = I[t <= t0] ] or many others. Let us see a real example: we use the myeloma data set that comes with emplik package data(myeloma) survtimes <- myeloma[,1] censtatus <- myeloma[,2] We shall test and construct confidence interval for F(10) (Failure probability at 10 month) based on Kaplan-Meier, so define myfun1 <- function(t){ as.numeric(t <= 10) } Now a test based on the empirical likelihood ratio for Ho: F(10) = 0.2 is el.cen.EM(fun=myfun1, x=survtimes, d=censtatus, mu=0.2) or el.cen.EM2(fun=myfun1, x=survtimes, d=censtatus, mu=0.2) among the output, you see a Pval (that is the P-value) of 0.3103897 so the conlusion is "not to reject the null hypothesis of F(10)=0.2" since the P-value is larger than 0.05. If you want the (NPMLE) estimate of F(10), look in the output for "funMLE" You will see a MLE of 0.2526136 (this is just the Kaplan-Meier at t=10) The function el.cen.EM2 in the emplik package is very similar. The function el.cen.EM is designed for a single parameter, and el.cen.EM2 can handle multiple parameters. el.cen.EM may be a bit faster, while el.cen.EM2 may be a bit more robust. For the subtle differences, please see the help pages in R. Now we can further find the 95% confidence interval for F(10). Either by trial and error (testing different mu values until you get p-value of 0.05) Or use the function findUL( ). For this purpose we need to define a function that just returns the -2LLR. and the first input of this function must be the value to be tested. myloglikR1 <- function(mu0, x, d) { myfun1 <- function(t){as.numeric(t<=10)} el.cen.EM(fun=myfun1, x=x, d=d, mu=mu0) } The real search for the lower(upper) confidence bound is done by the function findUL( ): > findUL(fun=myloglikR1, MLE=0.2526, x=survtimes, d=censtatus) $Low [1] 0.1566809 $Up [1] 0.3684038 $FstepL [1] 1e-10 $FstepU [1] 1e-10 $Lvalue [1] 3.84 $Uvalue [1] 3.84 So the 95% confidence interval for F(10) is [0.1566809, 0.3684038] If you want the 90% confidence interval instead, do this findUL(fun=myloglikR1, MLE=0.2526, x=survtimes, d=censtatus, level=2.705543) ======================================================================== In the above example, the Wald confidence interval can at least be computed easily, with the Greenwood formula to estimate the variance. In the next example it is almost impossible to get a Wald confidence interval. But the Wilks confidence interval is as easy as in example one. Example 2. Find the upper and lower confidence limit of the residual median, where the estimator is based on the Kaplan-Meier estimator. [residual median? or median residual?] First the definition: The residual median at a given time t0 is the number "theta" that solve the equation 1-F(t0 + theta ) ---------------- = 0.5 1-F(t0) The estimator of theta can be obtained by (a) replace F by the Kaplan-Meier in the above and (b) solve for theta. Notice the variance of the estimator of theta is very complicated, and even if you get a theoretical expression for it, it is not very useful, due to the fact that it involves density function, a quantity hard to estimate. So, let us see how Wilks proceed. To use the Wilks method, we shall re-cast the problem into a testing hypothesis problem. Notice that a theta solves the above equation iff it solves \int g_{theta} (t) dF(t) = 0.5 where g_{theta} (t) = I[t <= t0+theta] - 0.5*I[t <= t0] (simple algebra manipulations) and recall this form of hypothesis can be handled by the el.cen.EM function. Remark: due to the fact that the empirical quantile function is not continuous, it is more desirable to smooth the above indicator function in the definition of g_{theta}. R-code: library(survival) data(cancer) library(emplik) cancertimes <- cancer$time cancerstatus <- cancer$status -1 This estimates the median and mean residual time at t0=365.25 days MMRtime(x=cancertimes, d=cancerstatus, age=365.25) $MeanResidual [1] 275.9997 $MedianResidual [1] 258.75 Now the test that theta = 258 mygfun22 <- function(s, age, Mdage) {as.numeric(s<=(age+Mdage))-0.5*as.numeric(s<=age)} el.cen.EM(x=cancertimes, d=cancerstatus, fun=mygfun22, mu=0.5, age=365.25, Mdage=258)$"-2LLR" If we are lazy, we can just repeat the last line but change 258 to something else, until you get "-2LLR" as close to as possible (but not exceed 3.84) to find the 95% confidence interval. Or do it automatically as follows. To find the upper and lower confidence limit by findL/findU, we need to write a function that returns the "-2LLR" and the first input variable is the value to be tested. myloglikR2 <- function(mu0, x, d) { mygfun22 <- function(s, age=365.25, Mdage=mu0){as.numeric(s<=(age+Mdage))-0.5*as.numeric(s<=age)} el.cen.EM(x=x, d=d, fun=mygfun22, mu=0.5)$"-2LLR" } Finally, we may do this (but it took too long...slow search algorithm the default search step is too small.) takes too long: findL(fun=myloglikR2, MLE=258, x=time, d=status, level=2.705543) Better to do this (with bigger steps, search is a bit faster) findL(step=5,fun=myloglikR2, MLE=258, x=time, d=status, level=2.705543) [1] 184.75000000 0.00000005 2.50369332 findU(step=5,fun=myloglikR2, MLE=258, x=time, d=status, level=2.705543) [1] 321.75000000 0.00000005 2.42779336 So, The 90% confidence interval is [ 184.750, 321.750 ]. If the step is too large, the hypothesis to be tested may be out of the range for el.cen.EM to handle and may stop computing or too slow. (here use el.cen.EM2 may be more robust) Notice the final level reached is 2.503 or 2.427 and somewhat away (and lower) from 2.7055. This is normal, since this is a discrete case when testing the median. (or any quantiles) ============================================================================== Let us now try an example for the Cox model. This is similar to the example on page58 from the book "Modeling survival data" by Therneau and Grambsch. The data set involved in "pbc". > library(survival) > coxph(Surv(time,status==2)~age+edema+log(bili)+log(albumin)+log(protime), data=pbc)$loglik [1] -866.9573 -751.4697 We see the full model loglik value is -751.4697 (the max value). We then define a function that will offset the slope for log(protime) and return the -2LLR. > pbcloglik <- function(beta, max, pbc){ temp <- coxph(Surv(time,status==2)~age+edema+log(bili)+log(albumin)+offset(beta*log(protime)), data=pbc) loglikratio <- 2*( max - temp$loglik[2]) list("-2LLR"=loglikratio) } Now we are ready to find the confidence interval. You find the MLE of beta for log(protime) if you read the output of coxph( ) above, instead of looking only at loglik. > findUL(fun=pbcloglik, MLE=2.38, max=-751.4697, pbc=pbc) $Low [1] 0.8106858 $Up [1] 3.825281 $FstepL [1] 1e-09 $FstepU [1] 1e-09 $Lvalue [1] 3.84 $Uvalue [1] 3.84 So, the 95% Wilks confidence interval for beta1 is [0.8106858, 3.825281] ========================================================================