Friday, February 1, 2013

Empirical logit plot between x and binary to check their linear relationship


** purpose: draw the expirical logit plot between x and binary y ;

** to check if there is linear relation or not                   ;

 

%macro empplot(indata, xvar, yvar);

 

proc rank data=&indata groups=100 out=out;

        var &xvar;

        ranks bin;

run;

 

proc means data=out noprint nway;

        class bin;

        var &yvar &xvar;

        output out=bins sum(&yvar)=&yvar mean(&xvar)=&xvar;

run;

 

data bins;

        set bins;

        elogit=log((&yvar+(sqrt(_freq_)/2))/(_freq_-&yvar+(sqrt(_freq_)/2)));

run;

 

proc sgplot data=bins;

        reg y=elogit x=&xvar / curvelabel="Linear Relationship?"

                curvelabelloc=outside

                lineattrs=(color=ligr);

        series y=elogit x=&xvar;

        title "Empirical Logit against &xvar";

run;

 

proc sgplot data=bins;

        reg y=elogit x=bin /

                curvelabel="Linear Relationship?"

                curvelabelloc=outside

                lineattrs=(color=ligr);

        series y=elogit x=bin;

        title "Empirical Logit against Binned &xvar";

run;

 

%mend empplot;

 

 

An example: from previous univariate screening it is showing there is relatively strong relation between recent_view_count and action. So we draw their expirical logit as below:

%empplot(dyps.dyps_trainoversamp2, recentview_count, action);

 

         

 


Check the correlation between Independent variables with Binary Dependent variable by Spearman and Hoeffding's D statistic

1: How to check the correlation between Independent Variables(IV) and binary Dependent Variable(DV)? Spearman statistic and Hoeffding's D statistic is used. They are better than Pearson Chi-square statistic because they are less sensitive to outliers and nonlinearities. They calculate the correlation of rank of IV with DV. Usually if Spearman should have similar monotonic trend with Hoeffding. Otherwise it means there is non-linear relation ship.

If both Spearman and Hoeffding gives low correlations, then we can drop those variables. Like the price_change_pct, avg_rating, total_ratings and num_sellers in the example below.

The following macro is to test a group of IVs v.s. DV:

*===============================================================================================;
** use proc corr to examine the association between the inputs and the target var               ;
** Spearman corr is a corr of the ranks of the input var with the binary target,                ;
** it's less sensitive to nonlinearities and outliers then Pearson stats                        ;
** Hoeffding's D statistic is also used to check the association                                ;
** if spearman rank is high and hoeffding rank is low, then the association is not monotonic    ;
*===============================================================================================;
** The output is a macro var containing the selected variables by univariate screening          ;
*===============================================================================================;


%macro uniscreen(indata, varfile, target, pvalue);

filename varfile "&varfile";
** filename varfile "/home/hsong/varlist.txt";
data varall;
        infile varfile delimiter=',';
        length varname $1000.;
        input varname $ @@;
run;

proc print data=varall width=min;
title "print of varall";
run;

proc sql;
        select varname into: inputs separated by ' ' from varall;
        select count(*) into: nobs from varall;
quit;

%let nvar=%sysfunc(compress(&nobs));

ods html close;
ods output spearmancorr=spearman hoeffdingcorr=hoeffding;
proc corr data=&indata spearman hoeffding rank;
        var &inputs;
        with ⌖
run;
ods html;

data spearman1(keep=variable scorr spvalue ranksp);
        length variable $ 80.;
        set spearman;
        array best(*) best1--best&nvar;
        array r(*) r1--r&nvar;
        array p(*) p1--p&nvar;
        do i=1 to dim(best);
                variable=best(i);
                scorr=r(i);
                spvalue=p(i);
                ranksp=i;
                output;
        end;
run;

data hoeffding1(keep=variable hcorr hpvalue rankho);
        length variable $ 80.;
        set hoeffding;
        array best(*) best1--best&nvar;
        array r(*) r1--r&nvar;
        array p(*) p1--p&nvar;
        do i=1 to dim(best);
                variable=best(i);
                hcorr=r(i);
                hpvalue=p(i);
                rankho=i;
                output;
        end;
run;

proc sort data=spearman1;
        by variable;
run;

proc sort data=hoeffding1;
        by variable;
run;

data correlations;
        merge spearman1 hoeffding1;
        by variable;
run;

proc sort data=correlations;
        by ranksp;
run;

proc print data=correlations width=min;
title "print of correlations";
run;

proc sql noprint;
        select min(ranksp) into: vref
        from (select ranksp
                from correlations
                having spvalue > &pvalue);

        select min(rankho) into: href
        from (select rankho
                from correlations
                having hpvalue > &pvalue);
quit;

proc sgplot data=correlations;
        refline &vref / axis=y;
        refline &href / axis=x;
        scatter y=ranksp x=rankho / datalabel=variable;
        yaxis label="Rank of Spearman";
        xaxis label="Rank of Hoeffding";
        title "Scatter Plot of the Ranks of Spearman vs Hoeffding";
run;                                                                                            

proc sql;
        delete * from correlations
        where ranksp>&vref and rankho>&href;
quit;

%global screened;

proc sql;
        select trim(left(variable)) into: screened separated by ' ' from correlations;
quit;

%put &screened;

%mend;

libname dyps "/data02/temp/temp_hsong/product_banner";


%uniscreen(dyps.dyps_trainoversamp2, ./varfile.txt, action, .2);

There are four parameters here: the first is the data set name, the second is the txt file containing all IVs, the third is the DV and the last is the criteria set up for p-value to discard non-significant vars. To avoid dropping too many vars, usually this criteria value is set up as .5, here we set as .2 in the example.

The output is: 1) a graph shows the rank of spearman and hoeffding rank. 2) macro variable screened which contains the picked variables.


On the pic above, it means we can drop var price_change_pct, avg_rating, total_ratings and num_sellers.  Let's have a look at these data(over 90% of price_change_pct is 0, and over 75% of avg_rating and total_ratings are 0.):

price_change_pct           avg_rating              total_ratings    
                       

After this variable screening, we would think recentview_count ps_lowest_price days_after_ps_last_view keyword_count impressions rank pool_size lowest_price ref_rank days_after_ref days_after_last_view num_sellers can pass to next step.

Next we will check the linear relation between those IVs with DVs one by one. (Question: how to check linear relation between IV and a binary DV?)