Tuesday, March 20, 2012

proc format: format characters to numbers

learn proc format:




data test;
input start $ end $ label;
cards;
nextag nextag 1
#sina #sina 2
#1000 #1000 3
;
run;

data fmt0;
retain fmtname'$testfmt' ;
set test end=last;
start=start;
end=end;
label=label;
output;
if last then do;
hlo='O';
label=0;
output;
end;
run;

proc print data=fmt0;
run;

proc format cntlin=fmt0;
select $testfmt;
run;

data test;
input test $20.;
cards;
nextag
#sina
&ok
#1000
wokong
;
run;

data test_result;
set test;
id=put(test, $testfmt.)+0;
run;

proc print data=test_result;
run;



The output is:



Obs test id

1 nextag 1
2 #sina 2
3 &ok 0
4 #1000 3
5 wokong 0


index, rename, keep

A new data set test1 is read from test(two variables: id, a). We want to create an index for test1(index name is called aid since aid is indexed in data set aid). Pay attention how index, rename and keep is used in data step test1.



data test;
input id a;
cards;
1 1
3 3
2 5
;
run;

data aid (index=(aid));
input aid b;
cards;
1 1
2 3
3 4
;
run;

data test1(index=(aid));
set test(keep=id a );
rename id = aid;
run;

proc contents data=test1;
run;

data merge_1;
merge aid test1;
by aid;
run;

proc print data=merge_1;
run;


Thursday, March 15, 2012

use PROC FORMAT to rank

The question is: after calculate the percentage, we want to count from how many obs are in 0-1%, how many in 1%-2%, ..., how many in 99%-100%. For this simple example, we can use round or floor or ceil to get the result. But here shows how to use proc format to get it.

After we get cumulative percentage, we format the percentage with put function. Take care in the proc format don't forget to add hlo, otherwise it will gives error since the first start number is missing.



data test;
do i=1 to 1000;
x=ranpoi(8,8);
output;
end;
run;

data test;
set test;
y+x;
run;

proc sql;
create table test as
select x, y, y/max(y) as pct
from test;
quit;

data test2;
do i=1 to 100;
x=i/100;
y=lag(x);
output;
end;
run;

data a;
set test2;
fmtname='fmttestf';
start=y;
end=x;
label=i;
eexcl='Y';
if _n_=1 then hlo='L';
run;

proc format cntlin=a;
select fmttestf;
run;

data final;
set test;
rank=put(pct, fmttestf.)+0;
run;

proc print data=final;
run;

proc sql;
select rank, count(1) as cnt
from final
group by rank
order by rank;
quit;

Monday, March 12, 2012

sample from a data set with sample data sets are similar to each other

Purpose: sample 5 sub data set (each has 1000 obs) from a data set (test, which has 100000 obs). The purpose is to make sure in each sample, the mean of variable x is similar to the other samples. Here in the example set the difference of mean of x is less than .05.




data test0;
 do i=1 to 100000;
   x=ranuni(1);
   output;
 end;
run;

%let n_dataset=5;
%let m_size=1000;

%macro m_sample;
 %do j=1 %to 15;

   data test;
     set test0;
   run;

   %do i=1 %to &n_dataset;

     proc sort data=test;
       by i;
     run;

     proc surveyselect data=test %if &i>1 %then %do; (where=(group<1)) %end; out=sample_&i method=sys sampsize=1000;
     run;
     proc sort data=sample_&i;
       by i;
     run;
     data test;
       merge test sample_&i(in=in1);
       by i;
       if in1=1 then group=&i;
     run;

   %end;

   data test;
     set test;
     if group=. then group=6;
   run;

   proc summary data=test nomissing;
     class group;
     var x;
     output out=out_sum mean(x) = total;
   run;

   proc print data=out_sum width=min;
   run;

   data out_sum;
     set out_sum;
     id=1;
   run;

   data summary;
     merge out_sum(where=(_type_=1)) out_sum(where=(_type_=0) rename=(total=all_total));
     by id;
     pct_diff=abs(total/all_total-1);
     if pct_diff<.05 then flag=1;
   run;

   proc print data=summary width=min;
   run;

   proc sql;
     select sum(flag) into :flag from summary;
   quit;

   %if &flag=6 %then %do;
     endsas;
   %end;

 %end;

%mend;

%m_sample;


use SAS to read data from website: use filename url

a sample sas code:



filename testdata url "http://dl.dropbox.com/u/10684315/sas_pub/sas_read_webdata.csv";

data test;
  infile testdata dsd firstobs=2;
  length CANONICAL_KEYWORD $400.;
  input keyword_id CANONICAL_KEYWORD $ node_id rank;
run;

data test;
  set test;
  canonical_keyword=translate(canonical_keyword,' ','"');
  canonical_keyword=trim(left(canonical_keyword));
run;

proc contents data=test;
run;

proc print data=test width=min;
run;


An example code from SAS help:



filename foo url 
    'http://support.sas.com/techsup/service_intro.html';
       
data _null_;
   infile foo length=len;
   input record $varying200. len;
   put record $varying200. len;
   if _n_=15 then stop;
run; 

Thursday, March 8, 2012

split a big data set into several subsets

This questions comes from when I want to email a dataset, it's too large to email. So I need to split it into several subsets to email.

Usually for two variables data, 1 million obs will be about 10MB after gzip. This is a proper size to email.

Here shows an example how to split a 10000 records data into several subdata with each having about 2300 obs.


options mprint mlogic;

data test;
  do i=1 to 10000;
    x=rannor(1);
      output;
  end;
run;

%let n_obs=2300;

proc sql noprint;
  select count(1) into :m_total from test;
quit;

%macro sub_data;
    %let n_dataset=%sysfunc(ceil((&m_total/&n_obs)));
      %do i=1 %to &n_dataset;
        data sub_data_&i;
          set test;
        if (&i-1)*&n_obs+1<=_n_<=&i*&n_obs;
        run;
      %end;
%mend sub_data;

%sub_data;

how to sample several groups of data from a given data set, sampled data will not appear again?

e.g., data test has 10000 obs, we want to sample 3 data sets, called sample_1, sample_2 and sample_3. The obs in sample_1 will not appear in sample_2, the obs in sample 1, sample_2 will not appear in sample_3.

In the following we use an indicator called group. For sample_1, we srs from data test, and id its group=1. For sample_2, and so on, we restrict group<1 to exclude data in sample_1. And id sample_2 as group=2, and so on.


options mprint mlogic;

data test;
  do i=1 to 10000;
    x=rannor(1);
      output;
  end;
run;

%let n_sample=3;

%macro m_sample;
  %do i=1 %to &n_sample;
    proc surveyselect data=test %if &i>1 %then %do; (where=(group<1)) %end; method=srs sampsize=100 out=sample_&i;
    run;

      proc sort data=sample_&i;
        by i;
      run;
      proc sort data=test;
        by i;
      run;

      data test;
        merge test sample_&i(in=in2);
        by i;
        if in2 then group=&i;
      run;
  %end;
%mend m_sample;

%m_sample;

Wednesday, March 7, 2012

gzip several different files in linux

e.g., during sas running, there are several output and we want to send out these different output in one email.

1: use proc printto to print to lst or print to csv files; if for lst, remember to cat xx.lst >> xx.txt for easy to read.

2: tar different files together.   x "tar -cvf  to_be_together.tar  file1.csv file2.txt file3.log"; This step to tar file1, file2, file3 together into one file called to_be_together.tar

3: x "gzip -f to_be_together.tar"

that's it.