If both Spearman and Hoeffding gives low correlations, then we can drop those variables. Like the price_change_pct, avg_rating, total_ratings and num_sellers in the example below.

The following macro is to test a group of IVs v.s. DV:

*===============================================================================================;

** use proc corr to examine the association between the inputs and the target var ;

** Spearman corr is a corr of the ranks of the input var with the binary target, ;

** it's less sensitive to nonlinearities and outliers then Pearson stats ;

** Hoeffding's D statistic is also used to check the association ;

** if spearman rank is high and hoeffding rank is low, then the association is not monotonic ;

*===============================================================================================;

** The output is a macro var containing the selected variables by univariate screening ;

*===============================================================================================;

%macro uniscreen(indata, varfile, target, pvalue);

filename varfile "&varfile";

** filename varfile "/home/hsong/varlist.txt";

data varall;

infile varfile delimiter=',';

length varname $1000.;

input varname $ @@;

run;

proc print data=varall width=min;

title "print of varall";

run;

proc sql;

select varname into: inputs separated by ' ' from varall;

select count(*) into: nobs from varall;

quit;

%let nvar=%sysfunc(compress(&nobs));

ods html close;

ods output spearmancorr=spearman hoeffdingcorr=hoeffding;

proc corr data=&indata spearman hoeffding rank;

var &inputs;

with ⌖

run;

ods html;

data spearman1(keep=variable scorr spvalue ranksp);

length variable $ 80.;

set spearman;

array best(*) best1--best&nvar;

array r(*) r1--r&nvar;

array p(*) p1--p&nvar;

do i=1 to dim(best);

variable=best(i);

scorr=r(i);

spvalue=p(i);

ranksp=i;

output;

end;

run;

data hoeffding1(keep=variable hcorr hpvalue rankho);

length variable $ 80.;

set hoeffding;

array best(*) best1--best&nvar;

array r(*) r1--r&nvar;

array p(*) p1--p&nvar;

do i=1 to dim(best);

variable=best(i);

hcorr=r(i);

hpvalue=p(i);

rankho=i;

output;

end;

run;

proc sort data=spearman1;

by variable;

run;

proc sort data=hoeffding1;

by variable;

run;

data correlations;

merge spearman1 hoeffding1;

by variable;

run;

proc sort data=correlations;

by ranksp;

run;

proc print data=correlations width=min;

title "print of correlations";

run;

proc sql noprint;

select min(ranksp) into: vref

from (select ranksp

from correlations

having spvalue > &pvalue);

select min(rankho) into: href

from (select rankho

from correlations

having hpvalue > &pvalue);

quit;

proc sgplot data=correlations;

refline &vref / axis=y;

refline &href / axis=x;

scatter y=ranksp x=rankho / datalabel=variable;

yaxis label="Rank of Spearman";

xaxis label="Rank of Hoeffding";

title "Scatter Plot of the Ranks of Spearman vs Hoeffding";

run;

proc sql;

delete * from correlations

where ranksp>&vref and rankho>&href;

quit;

%global screened;

proc sql;

select trim(left(variable)) into: screened separated by ' ' from correlations;

quit;

%put &screened;

%mend;

libname dyps "/data02/temp/temp_hsong/product_banner";

%uniscreen(dyps.dyps_trainoversamp2, ./varfile.txt, action, .2);

There are four parameters here: the first is the data set name, the second is the txt file containing all IVs, the third is the DV and the last is the criteria set up for p-value to discard non-significant vars. To avoid dropping too many vars, usually this criteria value is set up as .5, here we set as .2 in the example.

The output is: 1) a graph shows the rank of spearman and hoeffding rank. 2) macro variable screened which contains the picked variables.

On the pic above, it means we can drop var price_change_pct, avg_rating, total_ratings and num_sellers. Let's have a look at these data(over 90% of price_change_pct is 0, and over 75% of avg_rating and total_ratings are 0.):

price_change_pct avg_rating total_ratings

After this variable screening, we would think recentview_count ps_lowest_price days_after_ps_last_view keyword_count impressions rank pool_size lowest_price ref_rank days_after_ref days_after_last_view num_sellers can pass to next step.

Next we will check the linear relation between those IVs with DVs one by one. (Question: how to check linear relation between IV and a binary DV?)