Thursday, September 6, 2012

Study Notes3 Predictive Modeling with Logistic Regression: group redundant variables

/* Redundant variables may degrade the analysis by: destabilize the estimation, overfitting, confounding*/
/* interpretation, and more computing time                                                              */
/* one solution for redundant variable is grouping the related variables                                */
/* proc varclus will find groups of variables that are as correlated as possible within groups and as   */
/* uncorrelated as possible with variables in other groups. Priciple component is done on the variables */
/* in each cluster by checking the Second Eigen Value. If it is greater than a given value, the cluster */
/* will be split into two child cluster. It's repeated until the SEV is less than the threshold         */

/* After the original variables are splited into different clusters, those clusters can be used to      */
/* replace the original variables as the predictors in the regression model, or a representative var    */
/* having high corelation with its own cluster and low correlation with the other clusters can be       */
/* selected from each cluster. The representative variable is picked by the 1-R^2 ratio                 */
/* 1-R^2 ratio = (1-R^2 of own cluster)/(1-R^2 of next cluster)                                         */


%let ex_inputs= MONTHS_SINCE_ORIGIN DONOR_AGE IN_HOUSE INCOME_GROUP PUBLISHED_PHONE
MOR_HIT_RATE WEALTH_RATING MEDIAN_HOME_VALUE MEDIAN_HOUSEHOLD_INCOME PCT_OWNER_OCCUPIED
PER_CAPITA_INCOME PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_VIETNAM_VETERANS
PCT_WWII_VETERANS PEP_STAR RECENT_STAR_STATUS FREQUENCY_STATUS_97NK RECENT_RESPONSE_PROP
RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP RECENT_AVG_CARD_GIFT_AMT RECENT_RESPONSE_COUNT
RECENT_CARD_RESPONSE_COUNT LIFETIME_CARD_PROM LIFETIME_PROM LIFETIME_GIFT_AMOUNT
LIFETIME_GIFT_COUNT LIFETIME_AVG_GIFT_AMT LIFETIME_GIFT_RANGE LIFETIME_MAX_GIFT_AMT
LIFETIME_MIN_GIFT_AMT LAST_GIFT_AMT CARD_PROM_12 NUMBER_PROM_12 MONTHS_SINCE_LAST_GIFT
MONTHS_SINCE_FIRST_GIFT    ;

proc varclus data=pva1 short hi maxeigen=.7 plots=dendrogram;
      var &ex_inputs mi_donor_age mi_income_group mi_wealth_rating;
run;


/* There are 23 clusters totally(based on the second egien value <= .7)                               */
/* Cluster one there are 5 variables, so total variation is 5. Variation explained by this cluster is */
/* 4.277752, which is about 85.56% of total variation. */




/* Those 23 clusters can be used as predictors directly, or a representative variable can be picked   */
/* from each cluster based on the 1-R^2 ratio. e.g., months_since_first_gift will represent cluster 1 */






No comments:

Post a Comment