/* Redundant variables may degrade the analysis by: destabilize the estimation, overfitting, confounding*/
/* interpretation, and more computing time */
/* one solution for redundant variable is grouping the related variables */
/* proc varclus will find groups of variables that are as correlated as possible within groups and as */
/* uncorrelated as possible with variables in other groups. Priciple component is done on the variables */
/* in each cluster by checking the Second Eigen Value. If it is greater than a given value, the cluster */
/* will be split into two child cluster. It's repeated until the SEV is less than the threshold */
/* After the original variables are splited into different clusters, those clusters can be used to */
/* replace the original variables as the predictors in the regression model, or a representative var */
/* having high corelation with its own cluster and low correlation with the other clusters can be */
/* selected from each cluster. The representative variable is picked by the 1-R^2 ratio */
/* 1-R^2 ratio = (1-R^2 of own cluster)/(1-R^2 of next cluster) */
%let ex_inputs= MONTHS_SINCE_ORIGIN DONOR_AGE IN_HOUSE INCOME_GROUP PUBLISHED_PHONE
MOR_HIT_RATE WEALTH_RATING MEDIAN_HOME_VALUE MEDIAN_HOUSEHOLD_INCOME PCT_OWNER_OCCUPIED
PER_CAPITA_INCOME PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_VIETNAM_VETERANS
PCT_WWII_VETERANS PEP_STAR RECENT_STAR_STATUS FREQUENCY_STATUS_97NK RECENT_RESPONSE_PROP
RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP RECENT_AVG_CARD_GIFT_AMT RECENT_RESPONSE_COUNT
RECENT_CARD_RESPONSE_COUNT LIFETIME_CARD_PROM LIFETIME_PROM LIFETIME_GIFT_AMOUNT
LIFETIME_GIFT_COUNT LIFETIME_AVG_GIFT_AMT LIFETIME_GIFT_RANGE LIFETIME_MAX_GIFT_AMT
LIFETIME_MIN_GIFT_AMT LAST_GIFT_AMT CARD_PROM_12 NUMBER_PROM_12 MONTHS_SINCE_LAST_GIFT
MONTHS_SINCE_FIRST_GIFT ;
proc varclus data=pva1 short hi maxeigen=.7 plots=dendrogram;
var &ex_inputs mi_donor_age mi_income_group mi_wealth_rating;
run;
/* There are 23 clusters totally(based on the second egien value <= .7) */
/* Cluster one there are 5 variables, so total variation is 5. Variation explained by this cluster is */
/* 4.277752, which is about 85.56% of total variation. */
/* Those 23 clusters can be used as predictors directly, or a representative variable can be picked */
/* from each cluster based on the 1-R^2 ratio. e.g., months_since_first_gift will represent cluster 1 */
No comments:
Post a Comment