## Monday, December 30, 2013

### R: how to debug the errors

options(error=recover): it will tell R to launch debug section and you can choose which one to debug

options(show.error.locations=TRUE): let R show the source line number

Something else:

use traceback() to locate where the last error message is and then use browser() to run the function again to check what is wrong.

## Tuesday, December 24, 2013

### A note about what is the output in predict.lm(lm_model, test_data, type=”terms”):

```
## first explain what is type="terms": If type="terms" is selected, a matrix of predictions
## on the additive scale is produced, each column giving the deviations from the overall mean
## (of the original data's response, on the additive scale), which is given by the attribute "constant".
set.seed(9999)
x1=rnorm(10)
x2=rnorm(10)
y=rnorm(10)
lmm=lm(y~x1+x2)
predict(lmm, data=cbind(x1,x2,y), type="terms")
lmm$coefficient[1]+lmm$coefficient[2]*x1+mean(lmm$coefficient[3]*x2)-mean(y)-predlm[,1]
lmm$coefficient[1]+lmm$coefficient[3]*x2+mean(lmm$coefficient[2]*x1)-mean(y)-predlm[,2]
```

## Monday, December 23, 2013

### R: Calculate ROC and Plot ROC

```
library(ROCR)
library(Hmisc)
## calculate AUC from the package ROCR and compare with it from Hmisc
# method 1: from ROCR
data(ROCR.simple)
pred=prediction(ROCR.simple$prediction, ROCR.simple$labels)
perf=performance(pred, 'tpr', 'fpr') #true positive and false negative
plot(perf, colorize=T)
perf2=performance(pred, 'auc')
auc=unlist(slot(perf2, 'y.values')) # this is the AUC
# method 2: from Hmisc
rcorrstat=rcorr.cens(ROCR.simple$prediction, ROCR.simple$labels)
rcorrstat[1] # 1st is AUC, 2nd is Accuracy Ratio(Gini Coefficient, or PowerStat, or Somer's D)
```

## Monday, September 9, 2013

### hadoop on cloudera quickstart vm test example 01 wordcount

Thanks to cloudera, we can test hadoop with its integrated tool kit (Cloudera QuickStart VM). it provides vmware, kvm and virtualbox edition to download. Everything is configured and you can test without any difficulty.

in this video, I show a example hot to do wordcount in hadoop. The youtube link is:

Steps not included in the video:

1: download vmware player or virtualbox and install;

2: download Cloudera QuickStart VM from cloudera(the link may change all the time, so you can google keyword "Cloudera QuickStart VM" to download).

The example of the wordcount test is:

1: install wget on centos server

sudo yum -y install wget

2: create test dir in /home/cloudera

mkdir /home/cloudera/test

cd /home/cloudera/test

3: create test txt file

echo "what can I do with hadoop on hadoop server or hive server" > test1.txt

4: put the txt file to hdfs

hdfs dfs -mkdir /user/cloudera/input

hdfs dfs -put /home/cloudera/test/test1.txt /user/cloudera/input/

5: go to /usr/lib/hadoop-mapreduce/

cd /usr/lib/hadoop-mapreduce/

6: run the job

hadoop jar hadoop-mapreduce-examples.jar wordcount /user/cloudera/input/test1.txt /user/cloudera/output

7: check what are there in the output

hdfs dfs -ls /user/cloudera/output/

8: reat the output file

hdfs dfs -cat /user/cloudera/output/part-r-00000

## Thursday, August 15, 2013

### proc sql: include redundant vars in select statement with group by

## Wednesday, July 31, 2013

### Fwd: Creating a Multidimensional Array for iris data

## Thursday, July 25, 2013

### The difference between BY and CLASS in PROC MEANS

CLASS and BY statements have similar effects but there are some subtle differences. In the documentation it says:

**Comparison of the BY and CLASS Statements**

Using the BY statement is similar to using the CLASS statement and the NWAY option in that PROC MEANS summarizes each BY group as an independent subset of the input data. Therefore, no overall summarization of the input data is available. However, unlike the CLASS statement, the BY statement requires that you previously sort BY variables.

When you use the NWAY option, PROC MEANS might encounter insufficient memory for the summarization of all the class variables. You can move some class variables to the BY statement. For maximum benefit, move class variables to the BY statement that are already sorted or that have the greatest number of unique values.

You can use the CLASS and BY statements together to analyze the data by the levels of class variables within BY groups.

Practically, this means that:

· The input dataset must be sorted by the BY variables. It doesn't have to be sorted by the CLASS variables.

· Without the NWAY option in the PROC MEANS statement, the CLASS statement will calculate summaries for each class variable separately as well as for each possible combination of class variables. The BY statement only provides summaries for the groups created by the combination of all BY variables.

· The BY summaries are reported in separate tables (pages) whereas the CLASS summaries appear in a single table.

· The MEANS procedure is more efficient at treating BY groups than CLASS groups.

options obs=10000;

libname temp "/data02/temp/temp_hsong/to_delete";

proc contents data=temp.high_vis_kws;

run;

proc sort data=temp.high_vis_kws out=high_vis_kws;

by nrank day_of_week;

run;

proc summary data=high_vis_kws;

by nrank day_of_week;

var clicks visits;

output out=classby1 sum=;

run;

proc summary data=high_vis_kws;

class nrank day_of_week;

var clicks visits;

output out=classby2 sum=;

run;

title "using by";

proc print data=classby1 width=min;

run;

title "using class";

proc print data=classby2 width=min;

run;

## Tuesday, April 30, 2013

### Distribution of The Difference of Two Uniform Distribution Variable

u1=runif(1000000,0,1) u2=runif(1000000,0,1) # Z is the difference of the two uniform distributed variable z=u1-u2 plot(density(z), main="Density Plot of the Difference of Two Uniform Variable", col=3) # X is the summation of the two uniform distributed variable x=u1+u2 plot(density(x), main="Density Plot of the Summation of Two Uniform Variable", col=3)

## Thursday, April 11, 2013

### Read SAS data into R (SAS is required)

### To read SAS data into R(SAS is necessary)

##1: first use SAS to export data to csv or dlm, then use R

libname test "/data/temp/hsong/test";

proc exportdata=test.hsb12 outfile="/data/temp/hsong/test/sasxport" dbms=dlm replace;

delimiter=",";

run;

# Then read in the exported data with R

read.table("/data/temp/hsong/test/sasxport", header=T)

` `

##2: to read in with the package {Hmisc}

library(Hmisc)

hsb12=sas.get(lib="/data/temp/hsong/test", mem="hsb12", as.is=T)

##3: read with library {foreign}, but I did not run it successfully

read.xport("path")

### use scan to read in a piece of data for ad-hoc analysis

` `

`A convenient way is to use scan function in R to directly read in the data by pasting them. Like:`

` `

x=scan(what=(list(a1=0,a2=0)))

` `

y=data.frame(x)

### Tips to incread the data reading speed in R

#The purpose is to compare the processing time when using different options to read in data with read.table

# test to read /data02/temp/temp_hsong/test/hsb12.sas7bdat

` `

#1: without any options

system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T))

` `

#2: By default R will convert character vars into factors wile reading

#suppress R convert character variables into factors by stringsAsFactors=F

system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, stringsAsFactors=F))

` `

#3: if data has no comment sign, then tell R

system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, comment.char=''))

` `

#4: roughly tell R a number slightly greater than the number of records

system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, nrows=8000))

` `

#5: When reading data, it's better tell R the mode of each var by colClasses

system.time(read.table("/data02/temp/temp_hsong/test/sasxport.txt", header=T, sep=',', colClasses=c(rep("numeric",11))))

## Monday, April 1, 2013

### The possible reason getting irregular coefficient(like we think it is positive, but it shows negetive)

The possible reason is we put high correlated variables in the predictors: suppose we have both sell_price and product_price in the model, and we know product_price is highly correlated to sell_price. An example is shown below in example 1.

Another possible reason is we recode the missing data: if missing data is at the left, but we recode it to the right(like if x<0 we recode x=99999 which is a very big number).

*Example 1:*

# set up random number seed

set.seed(1000)

` `

# generate 100 x with x ~ N(5,1)

x=rnorm(100,5,1)

` `

` `

# Y=5*X

y=5*x+rnorm(100,0,1)

` `

` `

# X1=100*X

x1=x*100+rnorm(100,0,.2)

` `

` `

` `

`1)it is reasonable that the regression coefficient is 4.947(while the true value is 5). `

#relation between x and y

lm(y~x)

` `

` `

`2) because x1=100*x, the coefficient for x1 should be 1/100 of above, as below:`

# relation between x1 and y

lm(y~x1)

` `

` `

`3) but if we put both x and x1 in the model, then we could not get the positive coefficient for both x and x1. This is because of multicollinearity. `

# put both x and x1 as the predictor

lm(y~x+x1)

` `

` `

` `

**Example 2:**

`4) another condition is from we recode missing data. Suppose if x<=4 is missing, and we recode missing as 99999.`

# if x<=4, treat it as 99999

x3=ifelse(x>4,x,99999)

` `

` `

`5) Then the regression coefficient is negative because of the 99999.`

# relation between x3 and y

lm(y~x3)

` `

The recode issue can be treated as a special case of non-linear relation.

## Thursday, March 28, 2013

### group obs almost evenly and calculate cumulative stats in SAS and R

**The input is 99 records, the requirement is to split it into 10 groups as evenly as possible(10 records each for first 9 groups, and 9 records in the last group). And then get the cumulative sum/mean in each group.**

** **

** **

**The output is like:**

** **

** **

**In SAS, a loop is required to do this because of cumulative sum. **

** **

**data** test;

i=**1**;

p=**.99**;

output;

do i=**2** to **98**;

p=**.8**;

output;

end;

i=**99**;

p=**.2**;

output;

**run**;

**proc** **print** data=test;

**run**;

* if use proc rank, it cannot group data evenly because of ties;

**proc** **rank** data=test out=t group=**10**;

var p;

ranks rank;

**run**;

**proc** **print** data=t;

**run**;

* so need to mannuly do it;

%let dsid=%sysfunc(open(test)); *open the file;

%let nobs=%sysfunc(attrn(&dsid,nobs)); *count the obs in file; %let ngroup=10; %let overall_pct=.5; %put &nobs;

* data n_per_group only has one obs;

**data** n_per_group;

n_per_grp=int(&nobs/&ngroup.); * get quotient;

remainder=mod(&nobs,&ngroup.); * get remainder;

array ps {&ngroup} ps1-ps&ngroup;

keep ps1-ps&ngroup;

do i=**1** to &ngroup;

if remainder>**0** then do;

ps{i}=n_per_grp+**1**;

remainder=remainder-**1**;

end;

else ps{i}=n_per_grp;

end;

output;

**run**;

**proc** **print** data=n_per_group;

**run**;

* read in the only one obs, and keep it in PVM until the end by using if _n_=1 then do statement;

**data** out(drop=freq _count_ i p);

if _n_=**1** then do;

set n_per_group;

index=**1**;

end;

retain freq _count_ **0** index ;

array ps(&ngroup) ps1-ps&ngroup;

set test end=last;

* a liitle tricky: keep on adding p together unitl the # of added obs = n_per_group as expected;

* if the # of added obs = n_per_group, calculate the stats we want, otherwise, keep on adding;

if _count_=ps(index) then do;

num_obs=ps(index);

avg_pred_p=sum_p/num_obs;

lift=avg_pred_p/&overall_pct;

output;

index+**1**;

_count_=**0**;

sum_p=**0**;

end;

sum_p+p;

_count_+**1**;

if last then do;

num_obs=ps(index);

avg_pred_p=sum_p/num_obs;

lift=avg_pred_p/&overall_pct;

output;

end;

**run**;

**proc** **print** data=out;

**run**;

## It is very easy to do this in R

## a simple way

rm(list=ls())

x=c(.9,rep(.8,97),.2)

ngrp=10

nobs=rep(length(x)%/%ngrp, ngrp)+c(rep(1,length(x)%%ngrp), rep(0,ngrp-length(x)%%ngrp))

levl=rep(1:ngrp, nobs)

df=data.frame(cbind(x,levl))

aggregate(x~levl, df, mean)