2011: For the Empirical Likelihood ratio (Wilks) theorem for hazard integrals with censored data: show the two modifications: (a) Verify that the null hypothesis \int g(t) d H(t) = \theta_0, can be replaced by \int g(t) d H(t) = \theta_0 + o(1/sqrt n) and the chi square limit for the -2 likelihood ratio still hold under H_0. (b) If we replace the null hypothesis by \int g(t) dH(t) = \theta_0 + ( C / \sqrt n) then the limit of the -2 log likelihood ratio has limit of a non central chi square distribution. Identify this non-centrality parameter. You may want to consult the note 2.1. TO make things simpler, you may assume just one constraint. (So the resulting chi square has df=1) ============================================================================== Is there a simple proof for the NPMLE of Kaplan-Meier? like Owen 2001's proof of the sampling distribution being NPMLE? Added 2010 Spring: Exercise: Without extra constraint, the maximization of the nonparametric likelihood function with right censored data, is solved by Kaplan-Meier (1958). Try to solve it yourself, without consult the paper. There is an explicit solution given by Kaplan-Meier, there is also a recursive solution. For 2007 Spring Jan. (0) Use a data set with more than 100 observations. Use SAS proc lifetest (version 9) obtain the confidence band for the Kaplan-Meier curve and point-wise confidence interval. Compare. Try to move the upper and lower time boundary for the confidence band closer to each other, observe the changes. Some keywords for SAS: confband = ep hw all bandmax bandmin conftype For people using R, you can do the same thing with a package called km.ci Load the package and read the help file. (1) Suppose BM(t) is a Brownian Motion process. Define the process BB(t) = BM(t) - t BM(1), for 0 <= t <= 1 Compute the variance of BB(t), and compute the covariance between BB(t) and BB(s) for 0 <= t < s <= 1. Verify that BB(t) do not have independent increment. ============================================================= (2). Suppose X_1, X_2, ... X_n are iid r.v.s with a cont. CDF F(t). Let U_n(t) = \sqrt n [ \hat F_n(t) - F(t) ] Compute the covariance of U(t) and U(s), i.e. E U(t) U(s) = ? (where \hat F_n(t) is the empirical distribution based on X_i's). (3). If N(t) is a Poisson process, show that N(3t) is also a Poisson process. (4) Suppose N_i(t), i=1,2, ... , n are iid Poisson ( lambda ) processes. 0 < t < 8 . (or some other fixed finite number). Show that: \sum N_i(t) is also a Poisson process with parameter (n lambda) So \sum N_i(t) and N_1( n t) are both Poisson (n lambda) process. ===================================================================== Feb. (5) Let M(t) = N (t) - t where N(t) is a standard Poisson process and also let us define the filtration F_t = history sigma algebra for info in [0, t] We verified in class that M(t) is an F_t martingale. Please verify [M(t)]^2 - t is also an F_t martingale. (hint, the variance of Poisson r.v. is also lambda) March. Use the technique we learned in class to derive the asymptotic variance of \sqrt n [ \int_0^\infty g(t) d \hat F(t) - \int_0^\infty g(t) dF(t)] assume the said variance is finite. Here g(t) is a continuous differentiable function. Hint: first use integration by parts to turn the above into \sqrt n [ \int_0^\infty 1-\hat F(t) dg(t) - \int_0^\infty 1-F(t) dg(t)] here \hat F(t) is the Kaplan-Meier estimator. A related but different quantity: find the asymptotic variance of the follow \sqrt n [ \int_0^\infty g(t) (1-\hat F(t-))d \hat H(t) - \int_0^\infty g(t) (1-\hat F(t-)) dH(t) ] here H(t) is the cumulative hazard function, and \hat H(t) is the Nelson-Aalen estimator. April Make use of the asymptotic variance formula we derived in March, of the quantity \sqrt n [ \int_0^\infty 1-\hat F(t) dg(t) - \int_0^\infty 1-F(t) dg(t)], derive an estimator of the asymptotic variance. Now construct a test based on the mean of g(): standardized to have variance 1 simulate the distribution (histogram) of the normalized test statistics and confirm it is approx. N(0,1) for samples of size 60 and (around) 15% censoring. use the following g() functions: g(t) = exp( -0.1 t^2 ) g(t) = t for 0<= t <= 3 = 3 for t > 3 You may want to take a look of my preliminary R code www.ms.uky.edu/~mai/sta709/varG.R for estimating the variance. Remark: if you do not get exactly 15% censoring, do not worry. anything between 10% and 25% is OK. I realize there are more problems with the coding of this ..... please take a look of the code in the above mentioned place. I updated the code. Last homework: A two sample test, based on the Kaplan-Meier mean: In the previous problem we worked on the one sample. If there are two samples, then a similar "t-test" can also be worked out. Testing whether the two (censored) samples have the same mean. Variance would be the sum of the two (by independence). i.e. numerator: \int g(t) d \hat F_1(t) - \int g(t) d\hat F_2(t) denominator: \sqrt ( Var est. 1 + Var est. 2 ). Compare this with the (2-sample) log-rank test. Interval censor: some code (Reference: The Statistical Analysis of Interval-Censored Failure Time Data, published by Springer, 2006 author: J. Sun) page. 77 simuIC <- function(n) { x <- 1:n/n ## x is the observation times, xorder <- order(x) ## ordered x <- x[xorder] ## z <- rbinom(n, size=1, p=0.5) ## z is the indicator of the sample1 or 2 z <- z[xorder] ## z also ordered according to x y1 <- rexp(n) ## y2 <- rexp(n) ## y1 and y2 are two samples of lifetimes y <- y1*z + y2*(1-z) ## put the two samples together yy <- as.numeric( y < x ) ## interval censoring, current status temp <- isoreg(x=x, y=yy) ## get the estimator hat F, as yf in temp weights <- runif(n) ## create some random weights SDF <- sum(z*(yy - temp$yf)*weights) ## this is the numerator of the test varSDF <- var(z*weights)*sum((yy - temp$yf)^2) ## var est of the numerator ZZ <- SDF/sqrt(varSDF) ## test statistic list( x=x, y=yy, f= temp$yf, ZZ=ZZ, diff=SDF, VD=varSDF) ## output list } Takehome final problem: A quantity important in some analysis is called AUC (Area Under Curve, for ROC curves). The definition of the AUC goes like this: There are two independent samples of survival times, with sample size n and m; and with population survival distributions F_1 and F_2 respectively. AUC = \int F_1(t) d F_2(t) . Based on the right censored data/sample, we can get two Kaplan-Meier estimators for the two samples. An estimator of AUC is obviously to replace the unknown F by the hat, (For censored data, \hat F should be the Kaplan-Meier estimator) \hat AUC = \int \hat F_1(t) d \hat F_2(t). What is the asymptotic variance of (as in a formula) \sqrt(n+m) [ \hat AUC - AUC ] ? assume both n and m increase to infinite, and the ratio n/m remains not close to 0 and not close to 1. Assume the variance is finite. Hint: (1) show that \hat AUC - AUC = \int (\hat F_1 - F_1) dF_2 + \int F_1 d(\hat F_2 - F_2) + \epsilon (argue, at least heuristically \epsilon is the high order small term, involve at least two differences, or difference square...) (2) Assume \sqrt(n+m) \epsilon is negligible as sample sizes grows, we focus on the two integrals. Variance of first integral can be handled similar to homework. Integration by parts on the second integral will bring it to the familiar form. (assume you always define the last obs. as uncensored) Anyone interested on some further reading, please see the comments at the bottom of this file. Old homework ============================================== (1). Suppose N(t) is a Poisson process. Verify the following is a (continuous time) martingale. M(t) = N(t) - lambda t for t in [0, T] Be sure to define your filtration (increasing family of sigma algebras) F_t. (The history) ==================================================== (2). Suppose N_i(t), i=1,2, ... , n are iid Poisson ( lambda ) processes. 0 < t < 8 . (or some other fixed finite number). Show that M_n(t) = \sum [ N_i(t) - lambda t ] / sqrt (n lambda ) as n \to \infty has the following property (a) for fixed t the distribution of M_n(t) converges to a Normal distribution with 0 mean Identify the variance. (b) the sample path of M_n(t) has jumps of size converge to 0. (1/sqrt n \lambda) Note: \sum N_i(t) is also a Poisson process with parameter (n lambda) ======================================================= (3). If N(t) is a Poisson process, show that N(3t) is also a Poisson process. =============================================================== (4). Suppose X_1, X_2, ... X_n are iid r.v.s with a cont. CDF F(t). Let U_n(t) = \sqrt n [ \hat F_n(t) - F(t) ] Compute the covariance of U(t) and U(s), i.e. E U(t) U(s) = ? (where \hat F_n(t) is the empirical distribution based on X_i's). ================================================================ (5). Suppose M(t) is the Poisson martingale we defined in the first problem. Show that M^2(t) - \lambda t is again a (cont. time) F_t martingale. (same filtration we used there). (5.5) Suppose M(t) is a martingale. Show that E[ M^2 (t + dt) - M^2 (t) |F_t] = E[ ( M(t+dt) - M(t) )^2 |F_t] ==================================================================== (6) Let N(t)= I[ X <= t ], where X is a positive, cont. rv. Define the filtration F_t = \sigma { N(s); 0 <= s <= t }. (a) show that h(t)I[X >= t] is predictable. where h(t) is the hazard function of X. (b) verify the conditional expectation of the increment of N(t) over (t-dt, t] , given F_{t-dt} (i.e. either given I[X >= t]=0 or given I[X >= t]=1) is h(t)I[X >= t] dt Note this shows that N(t) - \int_0^t h(s)I[X >= s] ds is an F_t martingale. ================================================================== (7) We proved the martingale representation of the Kaplan-Meier estimator for continuous CDF F(). Now Suppose that the CDF F() hava a jump at t, verify the martingale representation of the Kaplan-Meier estimator is still valid at t. For simplicity, assume all the observed death times do not equal to t. Also, when the CDF F() is not continuous, the compensator of the counting preocess needs to be written as \int_0^t I_{[X >= s]} d H(s) ; instead of \int_0^t I_{[X >= s]} h(s) ds . ================================================================= (8) Consider a stochastic jump process N(t), t>=0 that fit the following description: (a) N(0) = 0 (b) it can have at most one jump. (c) N(t) have probability zero to have jump for 0 <= t <= a (d) after the point a, the waiting time to the jump is Weibul( 1, 0.2) distribution; (e) the size of the jump is inversely proportional to the waiting time after a. (f) if after time c, N(t) still have no jump, then it will never jump. Assume 0= 0 (sample path are piecewise constant) that fit the following description: (a) N(0) == 0 (b) it can have at most two jumps. (c) N(t) have probability zero to jump for 0 <= t <= a (d) after the point a, the waiting time to the first jump is Weibul(2, 0.2) distribution; The waiting time to the second jump (after the first jump occured) is exp( \lambda= length of first waiting time). (e) after time c, N(t) will never jump. Assume 0Y] as a kernal of a U-statistics. And make the ELT theorem hold for all U-statistics, (well, all those with order 2 and finite variance) for example H(x,y) = xy. Or is this trivially true given Owen's outline proof? What about 3 samples? For censored data, we can try I[X>Y] too. What about U-stat?