Thursday, December 13, 2012

Cluster levels of Categorical variable to avoid over-fitting


Consider this context: target variable target_revenue is a continuous variable. The predictors include continuous variables like hist_visits, as well as categorical variable like best_leafnode_id, which has hundreds of levels.

If use this best_leafnode_id directly, we may fit the data well because of the many levels of predictor best_leafnode_id(since there are lots of levels, and there is an estimation for each level, so it's like we have more parameters, and we have more degree of freedom, and thus the fitting will be better). As is shown in the second graph below.

Because we input too many parameters, one potential problem is over-fitting. That is, we fit the training data well but it will not predict well on the validation dataset. 

From the second graph, if look at the training data, the reg7(reg with all levels of original data) has less MSE=83.95 which is less than reg6(regression with clustered levels) MSE=86.07. However, if we look at the validation data, we can see reg6(MSE=82.129) is less than reg7(MSE=105.655). That means if using the original data without clustering their levels, it will cause overfitting. It shows the clustered level method can avoid over-fitting. But we should choose a proper number of new levels. Not too small, not too large.

Below is the SAS code to cluster the huge level categorical variable: first calculate the mean of target_revenue at each level of best_leafnode_id, then bucket the levels of best_leafnode_id into less levels by the value of mean in the first step. Then we can format the old levels into new levels, the number of new levels is assigned by us. 


Pic01 -- SAS code to cluster levels





Pic02 -- Comparing the two models: 


No comments:

Post a Comment