Monday, December 30, 2013

R: how to debug the errors

Just some tips:

options(error=recover):  it will tell R to launch debug section and you can choose which one to debug


options(show.error.locations=TRUE):   let R show the source line number

Something else:

use traceback() to locate where the last error message is and then use browser() to run the function again to check what is wrong.

Tuesday, December 24, 2013

A note about what is the output in predict.lm(lm_model, test_data, type=”terms”):



## first explain what is type="terms":  If type="terms" is selected, a matrix of predictions 
## on the additive scale is produced, each column giving the deviations from the overall mean 
## (of the original data's response, on the additive scale), which is given by the attribute "constant".
 
set.seed(9999)
x1=rnorm(10)
x2=rnorm(10)
y=rnorm(10)
lmm=lm(y~x1+x2)
predict(lmm, data=cbind(x1,x2,y), type="terms")  
 
lmm$coefficient[1]+lmm$coefficient[2]*x1+mean(lmm$coefficient[3]*x2)-mean(y)-predlm[,1]
lmm$coefficient[1]+lmm$coefficient[3]*x2+mean(lmm$coefficient[2]*x1)-mean(y)-predlm[,2]

Monday, December 23, 2013

R: Calculate ROC and Plot ROC

library(ROCR)
library(Hmisc)
 
## calculate AUC from the package ROCR and compare with it from Hmisc
 
# method 1: from ROCR
data(ROCR.simple)
pred=prediction(ROCR.simple$prediction, ROCR.simple$labels)
perf=performance(pred, 'tpr', 'fpr')  #true positive and false negative
plot(perf, colorize=T)
 
perf2=performance(pred, 'auc')
auc=unlist(slot(perf2, 'y.values'))  # this is the AUC
 
# method 2: from Hmisc
rcorrstat=rcorr.cens(ROCR.simple$prediction, ROCR.simple$labels) 
rcorrstat[1]  # 1st is AUC, 2nd is Accuracy Ratio(Gini Coefficient, or PowerStat, or Somer's D)

Monday, September 9, 2013

hadoop on cloudera quickstart vm test example 01 wordcount

It's not easy to install hadoop and related items like hdfs hive and so on of your own, and it is more difficult to config them after installation.

Thanks to cloudera, we can test hadoop with its integrated tool kit (Cloudera QuickStart VM). it provides vmware, kvm and virtualbox edition to download. Everything is configured and you can test without any difficulty.

in this video, I show a example hot to do wordcount in hadoop. The youtube link is:


Steps not included in the video:
1: download vmware player or virtualbox and install;
2: download Cloudera QuickStart VM from cloudera(the link may change all the time, so you can google keyword "Cloudera QuickStart VM" to download).


The example of the wordcount test is:

1: install wget on centos server
sudo yum -y install wget
2: create test dir in /home/cloudera
mkdir /home/cloudera/test
cd /home/cloudera/test
3: create test txt file
echo "what can I do with hadoop on hadoop server or hive server" > test1.txt
4: put the txt file to hdfs
hdfs dfs -mkdir /user/cloudera/input
hdfs dfs -put /home/cloudera/test/test1.txt /user/cloudera/input/
5: go to  /usr/lib/hadoop-mapreduce/
cd /usr/lib/hadoop-mapreduce/
6: run the job
hadoop jar hadoop-mapreduce-examples.jar wordcount /user/cloudera/input/test1.txt /user/cloudera/output
7: check what are there in the output
hdfs dfs -ls /user/cloudera/output/
8: reat the output file
hdfs dfs -cat /user/cloudera/output/part-r-00000

Thursday, August 15, 2013

proc sql: include redundant vars in select statement with group by


This is to show: in proc sql group by statement, select should only contain group by vars and summary vars(vars to be summarized). Otherwise the result may be totally different from wanted.
 
data a;
input id a $2. value;
cards;
1 a 12
2 b 112
3 c 1121
1 a 3
2 b 23
3 c 15
1 a 16
;
run;
 
/* include both id and a in the select statement */
proc sql;
select id, a, sum(value) as v
from a
group by id;
quit;
 
/* include only id in the select statement */
proc sql;
select id, sum(value) as v
from a
group by id;
quit;
 
endsas;
 
/* output for include both id and a: there are duplicates */
      id  a          v
----------------------
       1  a         31
       1  a         31
       1  a         31
       2  b        135
       2  b        135
       3  c       1136
       3  c       1136
/* outpur for include only id */
      id         v
------------------
       1        31
       2       135
       3      1136

Wednesday, July 31, 2013

Fwd: Creating a Multidimensional Array for iris data


myiris=iris

names(myiris)=c("sl", "sw", "pl", "pw", "name")

with(myiris, cut(pl, breaks=quantile(pl, probs=seq(0,1, by=1/3)), include.lowest=TRUE))

plbin=with(myiris, cut(pl, breaks=quantile(pl, probs=seq(0,1, by=1/3)), include.lowest=TRUE))

pwbin=with(myiris, cut(pw, breaks=quantile(pw, probs=seq(0,1, by=1/3)), include.lowest=TRUE))

cbind(myiris, plbin, pwbin)->new

with(myiris, ftable(table(pwbin, plbin, name), row.vars=1:3))

## or    ftable(xtabs(~name+pwbin+pwbin,new))






Thursday, July 25, 2013

The difference between BY and CLASS in PROC MEANS

original post link:   https://communities.sas.com/thread/44285?start=0&tstart=0

 

CLASS and BY statements have similar effects but there are some subtle differences. In the documentation it says:

 

Comparison of the BY and CLASS Statements

Using the BY statement is similar to using the CLASS statement and the NWAY option in that PROC MEANS summarizes each BY group as an independent subset of the input data. Therefore, no overall summarization of the input data is available. However, unlike the CLASS statement, the BY statement requires that you previously sort BY variables. 
When you use the NWAY option, PROC MEANS might encounter insufficient memory for the summarization of all the class variables. You can move some class variables to the BY statement. For maximum benefit, move class variables to the BY statement that are already sorted or that have the greatest number of unique values. 
You can use the CLASS and BY statements together to analyze the data by the levels of class variables within BY groups.

 

Practically, this means that:

· The input dataset must be sorted by the BY variables. It doesn't have to be sorted by the CLASS variables.

· Without the NWAY option in the PROC MEANS statement, the CLASS statement will calculate summaries for each class variable separately as well as for each possible combination of class variables. The BY statement only provides summaries for the groups created by the combination of all BY variables.

· The BY summaries are reported in separate tables (pages) whereas the CLASS summaries appear in a single table.

· The MEANS procedure is more efficient at treating BY groups than CLASS groups.


options obs=10000;

libname temp "/data02/temp/temp_hsong/to_delete";

proc contents data=temp.high_vis_kws;
run;

proc sort data=temp.high_vis_kws out=high_vis_kws;
 by nrank day_of_week;
run;

proc summary data=high_vis_kws;
        by nrank day_of_week;
        var clicks visits;
        output out=classby1 sum=;
run;

proc summary data=high_vis_kws;
        class nrank day_of_week;
        var clicks visits;
        output out=classby2 sum=;
run;

title "using by";
proc print data=classby1 width=min;
run;

title "using class";
proc print data=classby2 width=min;
run;


Tuesday, April 30, 2013

Distribution of The Difference of Two Uniform Distribution Variable


Suppose U1~uniform(0,1), U2~uniform(0,1), and U1 is independent of U2, then what is the distribution of U1-U2?

The solution is in the picture attached. Then based on this, what is the distribution of U1+U2? The density should be of the same shape while it moves 1 unit to the right. The reason is: if U2 is uniform, then U3=1-U2 is uniform. so U1+U2 is the same as U1+U2=U1-U3+1. So it's density is the same as U1-U2 with i unit right transfer.


The answer can be verified in R. Density plot in R:
u1=runif(1000000,0,1)
u2=runif(1000000,0,1)
 
# Z is the difference of the two uniform distributed variable
z=u1-u2
plot(density(z), main="Density Plot of the Difference of Two Uniform Variable", col=3)
 
# X is the summation of the two uniform distributed variable
x=u1+u2
plot(density(x), main="Density Plot of the Summation of Two Uniform Variable", col=3)
Inline image 1
Inline image 1



Thursday, April 11, 2013

Read SAS data into R (SAS is required)

Those 3 ways can read SAS data into R. I prefer the first method since it gives the most choice to control.


### To read SAS data into R(SAS is necessary)
##1: first use SAS to export data to csv or dlm, then use R
libname test "/data/temp/hsong/test";
proc export data=test.hsb12 outfile="/data/temp/hsong/test/sasxport" dbms=dlm replace;
        delimiter=",";
run; 
# Then read in the exported data with R
read.table("/data/temp/hsong/test/sasxport", header=T)
 
##2: to read in with the package {Hmisc}
library(Hmisc) 
hsb12=sas.get(lib="/data/temp/hsong/test", mem="hsb12", as.is=T)
 
##3: read with library {foreign}, but I did not run it successfully
read.xport("path")

use scan to read in a piece of data for ad-hoc analysis

Sometimes it is necessary to read in a piece of data for ad-hoc analysis. We can save the data to txt and then read in by read.table. In this way we need to find physical path of the data. It is not convenient for ad-hoc analysis especially when data is in server and R is in local machine.
 
A convenient way is to use scan function in R to directly read in the data by pasting them. Like:
 
x=scan(what=(list(a1=0,a2=0)))
 

y=data.frame(x)


Then do any analysis like plot and so on based on df y.

Tips to incread the data reading speed in R

Here are some tips to increase the data reading speed in R:

#The purpose is to compare the processing time when using different options to read in data with read.table
# test to read /data02/temp/temp_hsong/test/hsb12.sas7bdat
  
#1: without any options 
system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T))
 
#2: By default R will convert character vars into factors wile reading
#suppress R convert character variables into factors by stringsAsFactors=F
system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, stringsAsFactors=F))
 
#3: if data has no comment sign, then tell R
system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, comment.char=''))
 
#4: roughly tell R a number slightly greater than the number of records
system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, nrows=8000))
 
#5: When reading data, it's better tell R the mode of each var by colClasses
system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, sep=',', colClasses=c(rep("numeric",11))))

The processing time is:

Monday, April 1, 2013

The possible reason getting irregular coefficient(like we think it is positive, but it shows negetive)

Sometimes, we will get irregular coefficient: consider sell_amount vs sell_price, we should expect a negative coefficient because as the price increases, the selling amount should become less from common sense. But sometimes it may give a positive coefficient.

 

The possible reason is we put high correlated variables in the predictors: suppose we have both sell_price and product_price in the model, and we know product_price is highly correlated to sell_price. An example is shown below in example 1.

 

Another  possible reason is we recode the missing data: if missing data is at the left, but we recode it to the right(like if x<0 we recode x=99999 which is a very big number).

 

Example 1:

# set up random number seed
set.seed(1000)
 
# generate 100 x with x ~ N(5,1)
x=rnorm(100,5,1)
 
 
# Y=5*X
y=5*x+rnorm(100,0,1)
 
 
 
# X1=100*X
x1=x*100+rnorm(100,0,.2)
 
 
 
1)it is reasonable that the regression coefficient is 4.947(while the true value is 5). 
#relation between x and y
lm(y~x)
 
 
2) because x1=100*x, the coefficient for x1 should be 1/100 of above, as below:
# relation between x1 and y
lm(y~x1)
 
 
3) but if we put both x and x1 in the model, then we could not get the positive coefficient for both x and x1. This is because of multicollinearity. 
# put both x and x1 as the predictor
lm(y~x+x1)
 
 
 

Example 2:

4) another condition is from we recode missing data. Suppose if x<=4 is missing, and we recode missing as 99999.
# if x<=4, treat it as 99999
x3=ifelse(x>4,x,99999)
 
 
5) Then the regression coefficient is negative because of the 99999.
# relation between x3 and y
lm(y~x3)
 

 

The recode issue can be treated as a special case of non-linear relation.

Thursday, March 28, 2013

group obs almost evenly and calculate cumulative stats in SAS and R

The input is 99 records, the requirement is to split it into 10 groups as evenly as possible(10 records each for first 9 groups, and 9 records in the last group). And then get the cumulative sum/mean in each group.

 

 

The output is like:

 

 

In SAS, a loop is required to do this because of cumulative sum.

 

data test;

     i=1;

     p=.99;

     output;

     do i=2 to 98;

           p=.8;

           output;

     end;

     i=99;

     p=.2;

     output;

run;

 

proc print data=test;

run;

 

* if use proc rank, it cannot group data evenly because of ties;

proc rank data=test out=t group=10;

     var p;

     ranks rank;

run;

 

proc print data=t;

run;

 

* so need to mannuly do it;

%let dsid=%sysfunc(open(test));          *open the file;

%let nobs=%sysfunc(attrn(&dsid,nobs));   *count the obs in file; %let ngroup=10; %let overall_pct=.5; %put &nobs;

 

* data n_per_group only has one obs;

data n_per_group;

     n_per_grp=int(&nobs/&ngroup.);  * get quotient;

     remainder=mod(&nobs,&ngroup.);  * get remainder;

     array ps {&ngroup} ps1-ps&ngroup;

     keep ps1-ps&ngroup;

     do i=1 to &ngroup;

           if remainder>0 then do;

                     ps{i}=n_per_grp+1;

                     remainder=remainder-1;

           end;

           else ps{i}=n_per_grp;

     end;

     output;

run;

 

proc print data=n_per_group;

run;

 

* read in  the only one obs, and keep it in PVM until the end by using if _n_=1 then do statement;

data out(drop=freq _count_ i p);

     if _n_=1 then do;

           set n_per_group;

           index=1;

     end;

    

     retain freq  _count_ 0 index ;

    

     array ps(&ngroup) ps1-ps&ngroup;

    

     set test end=last;

    

     * a liitle tricky: keep on adding p together unitl the # of added obs = n_per_group as expected;

     * if the # of added obs = n_per_group, calculate the stats we want, otherwise, keep on adding;

     if _count_=ps(index) then do;

           num_obs=ps(index);

           avg_pred_p=sum_p/num_obs;

           lift=avg_pred_p/&overall_pct;

           output;

           index+1;

           _count_=0;

           sum_p=0;

     end;

    

     sum_p+p;

     _count_+1;

    

     if last then do;

           num_obs=ps(index);

           avg_pred_p=sum_p/num_obs;

           lift=avg_pred_p/&overall_pct;

           output;

     end;

run;

 

proc print data=out;

run;

 

 

## It is very easy to do this in R

##  a simple way

rm(list=ls())

x=c(.9,rep(.8,97),.2)

ngrp=10

nobs=rep(length(x)%/%ngrp, ngrp)+c(rep(1,length(x)%%ngrp), rep(0,ngrp-length(x)%%ngrp))

levl=rep(1:ngrp, nobs)

df=data.frame(cbind(x,levl))

aggregate(x~levl, df, mean)