Some of the project topics are too small to be a 2-person project. If you want to work on a specific one please let me know. ================================= Possible projects for Sta 695: ================================= Added April 2009. (1) Adopt the R package kmci, bring it up to the current version of R. This is a package for obtain various confidence intervals related to the Kaplan-Meier estimator. Used to be a package in R available on CRAN. But needs to brought up to the new version. Older version still available on web. (2) Compare the Wilcoxon test: the version from the classic nonparametric text, and the one specialized from the Gehan-Wilcoxon test (when there is no censoring). The difference is in the variance estimation. Topics to investigate: which variance estimator is better (more acurate)? Under H0, or under Ha? Which give you a better normal approximation for the distribution of test statistic? Which one give you a better power? Added Feb. 2008. (1) Use empirical likelihood to test the hypothesis about mean residual life time. [see my notes: empirical likelihood and mean residual time] Write R function to test the mean/median residual time at a given age is equal to a given year, based on the code el.cen.EM2(), which will search min over a value automatically. Equality of two mean/median residual lifetime. Added 2005. (0) more efficient algorithm (than grid search) to find confidence region (for dim >=2) from likelihood ratio value. (0.5) Sample size determination for the logrank test. The influence of censoring etc. Survey of Softwares. Demonstration of free package. See http://www.biostat.wisc.edu/~kosorok/renyi.html Ref: Sample Size Calculations in Clinical Research by Shein-Chung Chow, Jun Shao and Hansheng Wang http://www.childrens-mercy.org/stats/weblog2004/survival.asp http://www.jhsph.edu/Research/Centers/CCT/javamarc/Shih/shihsizeuserguide.htm (1) A special type of regression model: y_i = a + bx_i + U e where Ui = exp( r x_i) , e is extreme value. How to parametricly estimate the a, b, r ? (MLE?) Is there a two-step least squares procedure? What about censored data? (2) Bootstrap applications in Survival analysis. There are many.... (distribution of logrank test for small sample size, etc) (3) Piecewise exponential: all the related stuff, carried to the limit.... including some R coding... * interval censored data MLE and how do estimator change as the number of pieces grow? * How to do a proc lifereg with error distribution as piecewise exponential? How do things change as number of pieces grow? etc. with fixed cutting points (not change with subjects)? (4) Stability/robustness of the Kaplan-Meier/Nelson-Aalen estimator. under error observation/perturbation; (observe value with error) under censoring indicator mis-classify, etc. (5) Implement Lin and Wei (1992) paper (of 3 pages) on the Buckley-James estimator and compare it with existing (EL) methods. (6) Cox model with a surviving fraction (cure model): model assumptions and implementation. Compare with regular Cox model. ===================================================================== (0). Estimation (Kaplan-Meier, Nelson-Aalen) with late entry/early withdraw data. The variance estimator. Confidence intervals--comparison of several methods. Take a look at code for the Cox model in R/SAS. (1). One sample log-rank and other rank tests. Compare one sample to the general population (census) data. (From survival package ratetables) Compare of several methods (accuracy of p-value). Also may include the covariates of race, age and sex in the test (adjust for covariate). Survival package in R (Splus) and its instructions should be helpful. (2). How to chose the weight function in the weighted log-rank type tests to maximize power? (more theoretical) Reference: Gill's book (censoring and stochastic integrals) (3). Similar to (2) but to demonstrate that the power of log-rank test can suffer for non-proportional hazards $H_A$. And the possible fix (apply the test only for certain time interval? or adhoc? or more systematically?). %How to handle late entry/early withdraw data? (3.5) (added 2006) combine two tests: a logrank and a test for cross hazards. Evaluate the (power) property. Update for 3 and 3.5: since I post my talk on this topic, this becomes a one person project. and I expect more examples. (4). Testing hypothesis for equality of, and confidence interval for difference of two medians. (use R function discemlik() or emplik.Hs.test()) See also (7), should work in close tie with (7). Better or worse than log-rank test? (If you want to work on this, I have some more info) (5). Residuals in the parametric regression (lifereg) and semi-parametric regression model (phreg). Types of residuals. How do they behave under correct and wrong models, (mis-specification, omition of covariate...). Simulations. Plots. (6). Frailty Model. An introduction and example. Use the exponential regression model with a random effects term to explain. Some useful references: Book by Therneau and Grambsch, Tech Report by Therneau, Grambsch and Pankratz. (7) Confidence Interval Estimation of median with censored data via empirical likelihood el.cen.EM(). Variations: Use a smoothed indicator function. Either a linear smoother or a cubic smoother. Cubic: G(t) = t - t^3/15 + 2 \sqrt 5/3 , for |t| < \sqrt 5 and zero or one otherwise. Linear: G(t) = t + 0.5 , for |t| < 0.5 and zero or one otherwise. Compare to the performance with plain indicator function. (8) Efficiency comparison between a Cox model and exponential/weibull model when all model are valid. (and small departures). Under various model specifications: beta value, censoring percentage etc. (see my Notes). (I will do this in class, so you cannot do it again :-( ). (9) Compare the performance of two versions of logrank test: one from the proc lifetest, another from proc phreg, score test. Try them on continuous and discrete distributions. (small project) Which has better power for small samples? or more accurate P values? Data set online: http://www.sci.usq.edu.au/staff/dunn/Datasets/tech-anova.html http://archive.ics.uci.edu/ml/ Also a project topic: Make the confint2( ) function for the Cox model coxph() and survreg() function, similar to how it works for the glm() function. Now, the only case that it returns Wilks likelihood ratio interval is for glm() and in other cases, it returns Wald interval. Cross validation of Buckley-James estimator: given a data, Buckley-James not only gives an estimator of beta, it also give an estimator of the error distribution CDF (by Kaplan-Meier). When we do cross validation, are we just validate the beta? or validate both? What is the sample size of the training sample and validation sample are very different? (I think cross validation should validate both beta and CDF).