Final (one of two) Due Dec. 16, 1PM



(1) Use the data generating code in simu3( ), but use LASSO to estimate the beta's, and compare with the
 PGA/OGA on the terms of RPE.  Notice, we can use cross validation to pick the best model when use LASSO.
 (it is inside the package glmnet, or penalized).


Please set seed, so that other people can repeat your results.
Please use a var value in the simu3( var=2.5 ) such that the question is not too easy (every time the method got it right)
and not too difficult (the estimation never got it right).

 
Please compare the two methods on many trial runs, like N=100. 
We compare the average RPE of the two methods.


You can also compare/report the correct selection numbers, but average RPE seems more important.



Remark: LASSO is available in the package glmnet.  Cross validation is also available inside the package glmnet.

Remark: If you want, you may play with the penalty function: this is the input alpha in function glmnet( ).
        LASSO is for alpha =1, but you can also try   alpha < 1 to see what happen.

Remark: I am expecting a summary report of your findings and all the R code as appendix.

====================================================================================

  Simulation Example of High Dim Regression


** A regression model with more predictors than observations (p > n)

** [Sparsity] However, most predictors are irrelavant (with beta =0).

** Two methods discussed: LASSO and variations, and step forward (boosting).



Here is the simulation Example for STA705

simu3 <- function( var=2.25 ) {
q=3
n=200
p=1000
beta=c(c(4.2,3.2,3.2),rep(0,p-q))   
sigma_square=var
eta=1
epsilon=rnorm(n,0,sqrt(sigma_square))
z=matrix(rnorm(n*p,1,1),n,p)
w=rnorm(n,0,1)
x=z+eta*w          ### the design matrix, with some corr
y0=x[,1:q]%*%beta[1:q]+epsilon  ###  i.e. a=0, save the 0 calc 


pga_obj = PGA(x,y0,1000)
oga_obj = OGA(x,y0,pga_obj$coeff_final)
M = which.min(oga_obj$HDBIC)
beta_fit = as.vector(oga_obj$coeff_all[,M])

n1 <- sum( beta_fit[2:4] !=0 )
n2 <- sum( beta_fit[5:1001] !=0 )
ab2 <- sum(abs((beta_fit[5:1001])))

## ADD LASSO estimation here

q <- 3
n <- 200
p <- 1000
eta <- 1
sigma_square=var
epsilon=rnorm(n,0,sqrt(sigma_square))
z=matrix(rnorm(n*p,1,1),n,p)
w=rnorm(n,0,1)
x=z+eta*w
y00=x[,1:q]%*%beta[1:q]+epsilon

y00 <- as.vector(y00)
beta_fit <- as.matrix(beta_fit)
RPE <- sum(( y00 - cbind( rep(1, n),x)%*%beta_fit)^2)
RPE <- RPE/(n*sigma_square) 

return( c(n1, n2, ab2, RPE ) )
}