Thursday, December 13, 2012

Cluster levels of Categorical variable to avoid over-fitting

Consider this context: target variable target_revenue is a continuous variable. The predictors include continuous variables like hist_visits, as well as categorical variable like best_leafnode_id, which has hundreds of levels.

If use this best_leafnode_id directly, we may fit the data well because of the many levels of predictor best_leafnode_id(since there are lots of levels, and there is an estimation for each level, so it's like we have more parameters, and we have more degree of freedom, and thus the fitting will be better). As is shown in the second graph below.

Because we input too many parameters, one potential problem is over-fitting. That is, we fit the training data well but it will not predict well on the validation dataset. 

From the second graph, if look at the training data, the reg7(reg with all levels of original data) has less MSE=83.95 which is less than reg6(regression with clustered levels) MSE=86.07. However, if we look at the validation data, we can see reg6(MSE=82.129) is less than reg7(MSE=105.655). That means if using the original data without clustering their levels, it will cause overfitting. It shows the clustered level method can avoid over-fitting. But we should choose a proper number of new levels. Not too small, not too large.

Below is the SAS code to cluster the huge level categorical variable: first calculate the mean of target_revenue at each level of best_leafnode_id, then bucket the levels of best_leafnode_id into less levels by the value of mean in the first step. Then we can format the old levels into new levels, the number of new levels is assigned by us. 

Pic01 -- SAS code to cluster levels

Pic02 -- Comparing the two models: 

Sunday, December 9, 2012

R: Replicate plot with ggplot2 (part 4) bar chart

The original plot from UCLA ATS is here. Here the same plot is done in package ggplot2.

Let's read in the data:
The bar chart here is to show how many observations for each level of ses(=1, 2 and 3 separately, will be replaced by "low", "median" and "high" separately).

The first plot is the plain bar chart:

The second one is to replace the original value of ses by "low", "median" and "high" separately:

The third one is to change the width of the bin.

The forth is to group the data by female first, and then bin on each level of female(0 or 1). This is stacked plot. From the plot it is not easy to figure out which is higher. So we need to plot it separately as is shown in the fifth graph.

The fifth one is plot bar chart at each level of female. That is, plot two bar chart for ses at 0 and 1 level of females separately.

R: Replicate plot with ggplot2 (part 3) boxplot

This is to replicate the boxplot from UCLA ATS: Examples of box plots

First read in the data:

The first is to draw the boxplot with no features

The second is filling the box area with blue color.

The third is the box plot splited by a categorical variable ses:

Next is to rename the levels of the categorical variable:

The fifth one is to set notch = TRUE

The sixth is the plot on the interaction levels of more than one categorical variable:

A little more, if we want to add outliers on the plot, it should be like:

Saturday, December 8, 2012

R: Replicate plot with ggplot2 (part 2) histogram, emprtical curve, normal density curve

The first one is to replicate the scatter plot here. This part is to replicat the plot of histogram.

The original UCLA ATS like is:

In ggplot2, geom_histogram is used to draw histogram.

The data is:

The first one is histogram with black fill:

The output graph is:

ggplot2 has more choice that you can fill in the color by the counts in each bin. like

Next is to change the binwidth. In ggplot2 this can be done by binwidth or by breaks. Here is shown by binwidth:

Next is to change from frequency to density(percentage in fact). This can be done by assigning y to be ..density.. in ggplot2:

Then is to add normal density and empirical curve in the plot.

First is the empirical curve:

Next is the Normal density with mean and std are from the generated data:

The last one is to add the counts on top of each bin. I did not figure out how to do it. It should be easy by adding text. But unitl now my process has not reached there.

Thursday, December 6, 2012

R: check variables with missing data and how many missing(NA) data in a dataframe

It's easy to check with apply function:
seo_rev=read.csv("D:\\hsong\\SkyDrive\\Public\\seo_kw_rev_train.csv", header=T, skip=2, na.string='.')
lapply(seo_rev, class)
## check which variables have missing value, and how many of the values are missing
getna<-function(x) sum(>0)
apply(seo_rev, 2, getna)
The output will list each variable's name with how many obs are missing for that variable.

Thursday, November 29, 2012

explore data with ggplot2

The purpose here is to explore the relation between variable actrpv(which is the actual value) vs lprpv(which is the descriptive data). We think there should be some relation between these two. Here use the graph to explore it.

rpvdata<-read.csv("D:\\SkyDrive\\Public\\landing_page_rpv\\landing_page_rpv.csv",header=T, skip=2)
sapply(rpvdata, class)
p<-ggplot(rpvdata, mapping=aes(x=lprpv, y=actrpv))
plot(lprpv, actrpv, col="blue", main="plot of lprpv vs actrpv")

It does not show any clear relation between these two variables from the graph above.

To make it a little clear, the data are grouped into 20 groups based on the percentiles of lprpv(5%, 10%, ..., 100%). And calculate t he weighted mean at each group weighted by variable visits_wt. It is shown the average of lprpv and average of actrpv are somehow linear related.

## bucket data into 20 buckets like sas proc rank did
lprpv_rank<-cut(lprpv, breaks=quantile(lprpv, probs=c(0:20/20)), labels=0:19, include.lowest=T, right=F)
## calculate mean of lprpv in each lprpv_rank
tapply(lprpv, lprpv_rank, mean)
## calculate weighted mean
df_wtm<, actrpv, lprpv_rank, visits_wt))
lprpv_m<-ddply(df_wtm, .(lprpv_rank), function(x) data.frame(lprpv_mean=weighted.mean(x$lprpv, x$visits_wt)))
actrpv_m<-ddply(df_wtm, .(lprpv_rank), function(x) data.frame(actrpv_mean=weighted.mean(x$actrpv, x$visits_wt)))
overall_wt_m<-merge(lprpv_m, actrpv_m, by="lprpv_rank")
## plot of the data with smoothed curve
ggplot(overall_wt_m, aes(x=lprpv_mean, y=actrpv_mean))+geom_point()+stat_smooth()
## linear regression without product_count information
summary(lm(actrpv_mean ~ lprpv_mean, data=overall_wt_m))

From the graph above, it shows a linear relation if we ignore the first several points with low value of lprpv_mean.

The linear regression result is given below. The adjusted R square is about .3936, which means the regression is not good. Usually R^2 > .8 means the regression is good.

Next we involve another variable called product_count, which is the numer of products on each page. If it is 1, means there is only 1 product on that page. In this condition, we should expect lprpv and actrpv are closely related.

### next include prod_cnt_flag in the analysis
## generate flag variables like sas format did
cutf=function(x) cut(x, breaks=quantile(x, probs=c(0:20/20)), labels=0:19, include.lowest=T, right=F)
prod_cnt_flag<-ifelse(product_count==0, 0, ifelse(product_count==1,1,ifelse(product_count==2,2,ifelse(product_count==3,3,4))))
for (i in 1:4){
  lprpv_rank<-cut(lprpv[product_count==i], breaks=quantile(lprpv[product_count==i], probs=c(0:20/20)), labels=0:19, include.lowest=T, right=F)
  df_wtm<[product_count==i], actrpv=actrpv[product_count==i], lprpv_rank, visits_wt=visits_wt[product_count==i]))
  lprpv_m<-ddply(df_wtm, .(lprpv_rank), function(x) data.frame(lprpv_mean=weighted.mean(x$lprpv, x$visits_wt)))
  actrpv_m<-ddply(df_wtm, .(lprpv_rank), function(x) data.frame(actrpv_mean=weighted.mean(x$actrpv, x$visits_wt)))
  overall_wt_m<-merge(lprpv_m, actrpv_m, by="lprpv_rank")
  overall_wt_m=data.frame(cbind(overall_wt_m, prod_cnt_group=i))
## or plot with ggplot2, concatenate the list to a long data frame, overall_m))
p<-ggplot(df, mapping=aes(lprpv_mean, actrpv_mean))
p+geom_point()+stat_smooth()+facet_wrap(~prod_cnt_group, ncol=1)

Now the data in the first group(top one) is more linear then before as is explained. For the other three groups they also linearly perform better than without group condition.

From this study, we should put the variable prod_cnt_group in the regression as a categorival variable. It will help to improve the model fitting.

key point: 1)how to calculate weighted mean 2)how to aggregate in each level of a factor, this can be done by tapply. But here is done by library(plyr). 3)how to draw separately for each factor level in ggplot2 with facet_wrap.

Wednesday, November 28, 2012

R: Replicate plot with ggplot2

The purpose is to replicate theose scatter plot from ucla ats with ggplot2. The original plots from ucla ats is:  Scatter plot

ggplot2 has less to remember than the plot in R base. Generally, it has those important concepts:

1: mapping and scale: mapping the data to plot attributes, like map data to x, y or colour, group and so on.

2: geometric object: plot points, lines, or curves, histogram, bar chart and so on

3: statistics: like regression, smoothing, or statistics

4: coordinate: choose data to show

5: layer and facet: group the data in different plots

Now I will show how to draw the graphs in the website above.

## to replicate:
hsb2 <- read.table('', header=T, sep=",")
p<-ggplot(hsb2, aes(x=math, y=write))
## graph1
## graph2

## graph3_4
(This graph 3-5 is different from the plots on ucla ats. It looks like there are some issues about their plot since if look at the two bigges math value for male, they are 75. But in their third plot this is lost. I have sent email to them to double check.)
## graph5
ggplot(hsb2, aes(x=math, y=write, colour=factor(female), shape=factor(female))) + geom_point() + geom_smooth(method="lm", fill=NA)

Monday, November 19, 2012

Simulation for An Interview Question

I was asked this question last year, and saw someone post it in misbbs yesterday.

The question is: A and B toss a dice by turns, the person who first gets 6 will win. Now A toss it first. What is the probability of A winning?

The answer should be: P(A)=6/11 and P(B)=5/11. We need to use conditional probability to solve it.

Solution: for first run, if A get 6, prob is 1/6. If A fails, prob is 5/6; then B get 6 is 1/6. So the prob of B win is A fails and B gets 6, which is 5/6*1/6=5/36.

For the second run, both A B fials in the first run, then A get 6, it is 5/6*5/6*1/6. For B, it is 5/6*5/6*5/6*1/6.

And so on.

So final A wins is: 1/6 + 5/6*5/6*1/6 * (5/6*5/6)^2*1/6 +... = (1/6)/(1-5/6*5/6) = 6/11.

Or we can see every time P(A) = 5/6 * P(B) and P(A) + P(B)=1. So P(A)=6/11.

It can be verified by this small program:

### veiry the dice toss question



for (jj in 1:1000000){
while (s != 6){

sum(rec %% 2 ==1)/1000000

The result is 0.545172 which is close to 6/11=0.5454545.


Saturday, November 10, 2012

R: linear regression diagnostic -- 1: Outlieers, Leverage and Influence data detect

The original one by SAS is here UCLA ATS. It focus on how to detect outliers, leverage and influences before doing any regression.

Here it will be replicated in R.

1: read in data and get summary info:

## since there is no data directly, I will use the spss data file 


2: Scatter plot of crime pctmetro poverty and single


here for convenience, convert the column names from upcase to lowcase.

From the scatter plot, it shows some points are far away from the others.

Outlier: if an independent variable has large residual, then it is an outlier;
Leverage: if an dependent variable(predictor) has an obs far away from the others, then it is called an leverage.
Influence: both outlier and leverage

3: plot of crime vs pctmetro, and label the plot points by another variable


From the plot it is clear to see the obs for DC might be an outlier.

Similarly the plots can be drawn for the other pairs of variables.

4: regression and diaglostics with statistical methods


The value R square=.8399, which means 83.99% of variation of data are explained by the model. Usually this value > .8 means the model is pretty good.

5: some important values

In statistics class, we have learned some statistics vocabularies like: student residual, leverage(diag of hat matrix), cook's distance and so on. In R it's easy to get them:

## calculate leverage, student residual, and cook's distance


## plot of leverage vs student residual
plot(r_std^2, leverage)
text(r_std^2,leverage, crime$state, cex=.5, pos=2)

It shows DC and MS have high residuals from the plot. DC has high leverage. So DC should be put more attention.

6: DFBETAS: it's calculated for each obs of all the dependent variables, which meaning how the regression coefficient will change if removing the obs.


The meaning: take an example for obs AK and var single, if include obs AK in the data, the regression coefficient will be increased by 0.0145*standard_error_of_single_coefficient=.0145*15.5=2.248.

Let's check this:


7: Plot all debetas on the same table to compare

text(dfb$single, crime$states,cex=.5,pos=2)

In all, the 51_st point(DC) shall catch our attention.

The possible reason is, all others are data for each state but DC is not a state.

8: Finally is the regression without obs from DC


Monday, November 5, 2012

R: how to draw added-variable plot (partial-regression plot)

Partial regression is very helpful in detecting influential points in multiple regression.  The one in SAS and introducion of added variable plot is here:

Here is how to do it in R with library(car):



## change upcase column names to lowcase



The output graph is given below. It is the graph of residuals of crime vs residuals of single while both crime and single are adjusted by the other variables(pctmetro+pctwhite+poverty). From the graph it shows DC should be taken care of. AK and WV are also the points that may be influential points.

Monday, October 22, 2012

Pick the latest code version by the version number

This question is from mitbbs. The original question is for CS.

 The question is: given a series of version number, how to pick the latest version code? For example, given "1.3.8", "1.23.2", "1.9.15", "2.1", "0.99.99", "", how to pick the latest code? (it should be "2.1" here in the example).  In SAS, we can easily split the version number into cells(use array) and then fill up each cell with 0(use format) to compare.

 The code is like:

data a;
input a & $100.;

proc print data=a;

data b;
        set a;
        array a_split[*] a_1-a_10;
        do i=1 to dim(a_split);

data b;
        set b(drop=i);
        format _numeric_ z10.;

proc sort data=b;
        by a_1--a_10;

proc print data=b width=min;

The final output is:

Thursday, October 11, 2012

Linux and SAS time set up

When kick off jobs for others, it requires to use first day and last day of last month as the parameter. If run sas script file in cron job and pick the time using linux command:

fst_day_this_mon=$(date -d "$date" +01%b%Y)
lst_day_last_mon=$(date -d "-1 day $fst_day_this_mon" +%d%b%Y)

echo fst_day_this_mon=$fst_day_this_mon  lst_day_last_mon=$lst_day_last_mon

In SAS we can use intnx function:

## in sas we can use function intnx with optional argument 'B' 'E'
## data _null_;
##      x=today();
##      stdt=intnx('month',x,-1,'B');
##      eddt=intnx('month',x,-1,'E');
##      call symput('stdt', put(stdt,date9.));
##      call symput('eddt', put(eddt,date9.));
## run;
## %put stdt=&stdt  eddt=&eddt;

Another similar code to pick the first day is:

d1=$(date -d "-1 month -$(($(date +%d)-1)) days" +"%d%b%Y")
d2=$(date -d "+13 day $d1" +"%d%b%Y")
d3=$(date -d "$d1+14days" +"%d%b%Y")
d4=$(date -d "$d1+27days" +"%d%b%Y")

echo "$d1" "$d2" "$d3" "$d4"

Wednesday, October 3, 2012

Rename all the variables in a data set

Some times it is necessary to rename all the variables. For example, it may be more convenient to use in array part.

An example to rename all the variables is:

data a;
input a b c d e f g h i j k l m n o p q;
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3

proc contents data=a out=varnames;

proc print data=varnames;

filename pre "./pre.txt";

data _null_;
 file pre;
        set varnames end=eof;
 if _n_=1 then put "rename ";
        put name '= ' newvarname;
 if eof then put ';';

data b;
        set a;
        %include pre;

proc print data=b;

Thursday, September 6, 2012

Study Notes3 Predictive Modeling with Logistic Regression: group redundant variables

/* Redundant variables may degrade the analysis by: destabilize the estimation, overfitting, confounding*/
/* interpretation, and more computing time                                                              */
/* one solution for redundant variable is grouping the related variables                                */
/* proc varclus will find groups of variables that are as correlated as possible within groups and as   */
/* uncorrelated as possible with variables in other groups. Priciple component is done on the variables */
/* in each cluster by checking the Second Eigen Value. If it is greater than a given value, the cluster */
/* will be split into two child cluster. It's repeated until the SEV is less than the threshold         */

/* After the original variables are splited into different clusters, those clusters can be used to      */
/* replace the original variables as the predictors in the regression model, or a representative var    */
/* having high corelation with its own cluster and low correlation with the other clusters can be       */
/* selected from each cluster. The representative variable is picked by the 1-R^2 ratio                 */
/* 1-R^2 ratio = (1-R^2 of own cluster)/(1-R^2 of next cluster)                                         */


proc varclus data=pva1 short hi maxeigen=.7 plots=dendrogram;
      var &ex_inputs mi_donor_age mi_income_group mi_wealth_rating;

/* There are 23 clusters totally(based on the second egien value <= .7)                               */
/* Cluster one there are 5 variables, so total variation is 5. Variation explained by this cluster is */
/* 4.277752, which is about 85.56% of total variation. */

/* Those 23 clusters can be used as predictors directly, or a representative variable can be picked   */
/* from each cluster based on the 1-R^2 ratio. e.g., months_since_first_gift will represent cluster 1 */