2011:
For the Empirical Likelihood ratio (Wilks) theorem for hazard integrals with
censored data: show the two modifications:
(a) Verify that the null hypothesis \int g(t) d H(t) = \theta_0, can be replaced by
\int g(t) d H(t) = \theta_0 + o(1/sqrt n) and the chi square
limit for the -2 likelihood ratio still hold under H_0.
(b) If we replace the null hypothesis by \int g(t) dH(t) = \theta_0 + ( C / \sqrt n)
then the limit of the -2 log likelihood ratio has limit of a non central
chi square distribution. Identify this non-centrality parameter.
You may want to consult the note 2.1. TO make things simpler, you may assume just
one constraint. (So the resulting chi square has df=1)
==============================================================================
Is there a simple proof for the NPMLE of Kaplan-Meier? like Owen 2001's
proof of the sampling distribution being NPMLE?
Added 2010 Spring:
Exercise: Without extra constraint, the maximization of the nonparametric likelihood
function with right censored data, is solved by Kaplan-Meier (1958). Try to solve
it yourself, without consult the paper.
There is an explicit solution given by Kaplan-Meier,
there is also a recursive solution.
For 2007 Spring
Jan.
(0) Use a data set with more than 100 observations.
Use SAS proc lifetest (version 9)
obtain the confidence band for the Kaplan-Meier
curve and point-wise confidence interval. Compare.
Try to move the upper and lower time boundary for the
confidence band closer to each other, observe the changes.
Some keywords for SAS: confband = ep hw all
bandmax
bandmin
conftype
For people using R, you can do the same thing with
a package called km.ci Load the package and read the help
file.
(1) Suppose BM(t) is a Brownian Motion process.
Define the process
BB(t) = BM(t) - t BM(1), for 0 <= t <= 1
Compute the variance of BB(t), and compute
the covariance between BB(t) and BB(s) for
0 <= t < s <= 1.
Verify that BB(t) do not have independent increment.
=============================================================
(2). Suppose X_1, X_2, ... X_n are iid r.v.s with a cont. CDF F(t).
Let U_n(t) = \sqrt n [ \hat F_n(t) - F(t) ]
Compute the covariance of U(t) and U(s), i.e. E U(t) U(s) = ?
(where \hat F_n(t) is the empirical distribution based on X_i's).
(3). If N(t) is a Poisson process, show that N(3t) is also a Poisson
process.
(4) Suppose N_i(t), i=1,2, ... , n are iid Poisson ( lambda )
processes. 0 < t < 8 . (or some other fixed finite number).
Show that: \sum N_i(t) is also a Poisson process with parameter (n lambda)
So \sum N_i(t) and N_1( n t) are both Poisson (n lambda) process.
=====================================================================
Feb.
(5) Let M(t) = N (t) - t where N(t) is a standard Poisson process
and also let us define the filtration
F_t = history sigma algebra for info in [0, t]
We verified in class that M(t) is an F_t martingale.
Please verify [M(t)]^2 - t is also an F_t martingale.
(hint, the variance of Poisson r.v. is also lambda)
March.
Use the technique we learned in class to derive
the asymptotic variance of
\sqrt n [ \int_0^\infty g(t) d \hat F(t) - \int_0^\infty g(t) dF(t)]
assume the said variance is finite. Here g(t) is a continuous differentiable
function.
Hint: first use integration by parts to turn the above into
\sqrt n [ \int_0^\infty 1-\hat F(t) dg(t) - \int_0^\infty 1-F(t) dg(t)]
here \hat F(t) is the Kaplan-Meier estimator.
A related but different quantity: find the asymptotic
variance of the follow
\sqrt n [ \int_0^\infty g(t) (1-\hat F(t-))d \hat H(t) -
\int_0^\infty g(t) (1-\hat F(t-)) dH(t) ]
here H(t) is the cumulative hazard function, and \hat H(t) is the
Nelson-Aalen estimator.
April
Make use of the asymptotic variance formula we derived in March, of the quantity
\sqrt n [ \int_0^\infty 1-\hat F(t) dg(t) - \int_0^\infty 1-F(t) dg(t)],
derive an estimator of the asymptotic variance.
Now construct a test based on the mean of g(): standardized to
have variance 1
simulate the distribution (histogram) of the normalized
test statistics and confirm it is approx. N(0,1)
for samples of size 60 and (around) 15% censoring. use the following
g() functions: g(t) = exp( -0.1 t^2 )
g(t) = t for 0<= t <= 3
= 3 for t > 3
You may want to take a look of my preliminary R code
www.ms.uky.edu/~mai/sta709/varG.R
for estimating the variance.
Remark: if you do not get exactly 15% censoring, do not worry.
anything between 10% and 25% is OK.
I realize there are more problems with the coding of this ..... please take a
look of the code in the above mentioned place. I updated the code.
Last homework: A two sample test, based on the Kaplan-Meier mean:
In the previous problem we worked on the one sample.
If there are two samples, then a similar "t-test" can also be worked out.
Testing whether the two (censored) samples have the same mean.
Variance would be the sum of the two (by independence).
i.e.
numerator: \int g(t) d \hat F_1(t) - \int g(t) d\hat F_2(t)
denominator: \sqrt ( Var est. 1 + Var est. 2 ).
Compare this with the (2-sample) log-rank test.
Interval censor: some code
(Reference: The Statistical Analysis of Interval-Censored
Failure Time Data, published by Springer, 2006 author: J. Sun)
page. 77
simuIC <- function(n) {
x <- 1:n/n ## x is the observation times,
xorder <- order(x) ## ordered
x <- x[xorder] ##
z <- rbinom(n, size=1, p=0.5) ## z is the indicator of the sample1 or 2
z <- z[xorder] ## z also ordered according to x
y1 <- rexp(n) ##
y2 <- rexp(n) ## y1 and y2 are two samples of lifetimes
y <- y1*z + y2*(1-z) ## put the two samples together
yy <- as.numeric( y < x ) ## interval censoring, current status
temp <- isoreg(x=x, y=yy) ## get the estimator hat F, as yf in temp
weights <- runif(n) ## create some random weights
SDF <- sum(z*(yy - temp$yf)*weights) ## this is the numerator of the test
varSDF <- var(z*weights)*sum((yy - temp$yf)^2) ## var est of the numerator
ZZ <- SDF/sqrt(varSDF) ## test statistic
list( x=x, y=yy, f= temp$yf, ZZ=ZZ, diff=SDF, VD=varSDF) ## output list
}
Takehome final problem:
A quantity important in some analysis is called AUC
(Area Under Curve, for ROC curves).
The definition of the AUC goes like this:
There are two independent samples of survival times, with
sample size n and m; and
with population survival distributions F_1 and F_2 respectively.
AUC = \int F_1(t) d F_2(t) .
Based on the right censored data/sample,
we can get two Kaplan-Meier estimators for the two samples.
An estimator of AUC is obviously to replace the unknown F by the hat,
(For censored data, \hat F should be the Kaplan-Meier estimator)
\hat AUC = \int \hat F_1(t) d \hat F_2(t).
What is the asymptotic variance of (as in a formula)
\sqrt(n+m) [ \hat AUC - AUC ] ?
assume both n and m increase to infinite, and the
ratio n/m remains not close to 0 and not close to 1.
Assume the variance is finite.
Hint: (1) show that \hat AUC - AUC =
\int (\hat F_1 - F_1) dF_2 + \int F_1 d(\hat F_2 - F_2) + \epsilon
(argue, at least heuristically \epsilon is the high order
small term, involve at least two differences, or
difference square...)
(2) Assume \sqrt(n+m) \epsilon is negligible as sample sizes grows,
we focus on the two integrals. Variance of first integral can
be handled similar to homework.
Integration by parts on the second integral will bring it to
the familiar form. (assume you always define the last obs. as
uncensored)
Anyone interested on some further reading, please see the comments
at the bottom of this file.
Old homework
==============================================
(1). Suppose N(t) is a Poisson process.
Verify the following is a (continuous time) martingale.
M(t) = N(t) - lambda t
for t in [0, T]
Be sure to define your filtration (increasing family
of sigma algebras) F_t. (The history)
====================================================
(2). Suppose N_i(t), i=1,2, ... , n are iid Poisson ( lambda )
processes. 0 < t < 8 . (or some other fixed finite number).
Show that
M_n(t) = \sum [ N_i(t) - lambda t ] / sqrt (n lambda )
as n \to \infty has the following property
(a) for fixed t the distribution of M_n(t)
converges to a Normal distribution with 0 mean
Identify the variance.
(b) the sample path of M_n(t) has jumps of size converge to 0.
(1/sqrt n \lambda)
Note: \sum N_i(t) is also a Poisson process with parameter (n lambda)
=======================================================
(3). If N(t) is a Poisson process, show that N(3t) is also a Poisson
process.
===============================================================
(4). Suppose X_1, X_2, ... X_n are iid r.v.s with a cont. CDF F(t).
Let U_n(t) = \sqrt n [ \hat F_n(t) - F(t) ]
Compute the covariance of U(t) and U(s), i.e. E U(t) U(s) = ?
(where \hat F_n(t) is the empirical distribution based on X_i's).
================================================================
(5). Suppose M(t) is the Poisson martingale we defined in the first problem.
Show that
M^2(t) - \lambda t
is again a (cont. time) F_t martingale.
(same filtration we used there).
(5.5) Suppose M(t) is a martingale. Show that
E[ M^2 (t + dt) - M^2 (t) |F_t] = E[ ( M(t+dt) - M(t) )^2 |F_t]
====================================================================
(6) Let N(t)= I[ X <= t ], where X is a positive, cont. rv.
Define the filtration F_t = \sigma { N(s); 0 <= s <= t }.
(a) show that h(t)I[X >= t] is predictable.
where h(t) is the hazard function of X.
(b) verify the conditional expectation of the
increment of N(t) over (t-dt, t] , given F_{t-dt}
(i.e. either given I[X >= t]=0 or given I[X >= t]=1)
is h(t)I[X >= t] dt
Note this shows that N(t) - \int_0^t h(s)I[X >= s] ds
is an F_t martingale.
==================================================================
(7) We proved the martingale representation of the Kaplan-Meier estimator
for continuous CDF F(). Now Suppose that the CDF F() hava a jump at t,
verify the
martingale representation of the Kaplan-Meier estimator is still valid at t.
For simplicity, assume all the observed death times do not equal to t.
Also, when the CDF F() is not continuous, the compensator
of the counting preocess needs to be written as
\int_0^t I_{[X >= s]} d H(s) ; instead of
\int_0^t I_{[X >= s]} h(s) ds .
=================================================================
(8)
Consider a stochastic jump process N(t), t>=0 that fit the
following description:
(a) N(0) = 0
(b) it can have at most one jump.
(c) N(t) have probability zero to have jump for 0 <= t <= a
(d) after the point a, the waiting time to the jump is
Weibul( 1, 0.2) distribution;
(e) the size of the jump is inversely proportional to the
waiting time after a.
(f) if after time c, N(t) still have no jump, then it will never jump.
Assume 0= 0 (sample path
are piecewise constant)
that fit the following description:
(a) N(0) == 0
(b) it can have at most two jumps.
(c) N(t) have probability zero to jump for 0 <= t <= a
(d) after the point a, the waiting time to the first jump is
Weibul(2, 0.2) distribution; The waiting time to the second jump
(after the first jump occured) is exp( \lambda= length of first waiting time).
(e) after time c, N(t) will never jump.
Assume 0Y] as a kernal of a U-statistics.
And make the ELT theorem hold for all U-statistics,
(well, all those with order 2 and finite variance)
for example H(x,y) = xy. Or is this trivially true
given Owen's outline proof?
What about 3 samples?
For censored data, we can try I[X>Y] too. What about U-stat?