If both Spearman and Hoeffding gives low correlations, then we can drop those variables. Like the price_change_pct, avg_rating, total_ratings and num_sellers in the example below.
The following macro is to test a group of IVs v.s. DV:
*===============================================================================================;
** use proc corr to examine the association between the inputs and the target var ;
** Spearman corr is a corr of the ranks of the input var with the binary target, ;
** it's less sensitive to nonlinearities and outliers then Pearson stats ;
** Hoeffding's D statistic is also used to check the association ;
** if spearman rank is high and hoeffding rank is low, then the association is not monotonic ;
*===============================================================================================;
** The output is a macro var containing the selected variables by univariate screening ;
*===============================================================================================;
%macro uniscreen(indata, varfile, target, pvalue);
filename varfile "&varfile";
** filename varfile "/home/hsong/varlist.txt";
data varall;
infile varfile delimiter=',';
length varname $1000.;
input varname $ @@;
run;
proc print data=varall width=min;
title "print of varall";
run;
proc sql;
select varname into: inputs separated by ' ' from varall;
select count(*) into: nobs from varall;
quit;
%let nvar=%sysfunc(compress(&nobs));
ods html close;
ods output spearmancorr=spearman hoeffdingcorr=hoeffding;
proc corr data=&indata spearman hoeffding rank;
var &inputs;
with ⌖
run;
ods html;
data spearman1(keep=variable scorr spvalue ranksp);
length variable $ 80.;
set spearman;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
scorr=r(i);
spvalue=p(i);
ranksp=i;
output;
end;
run;
data hoeffding1(keep=variable hcorr hpvalue rankho);
length variable $ 80.;
set hoeffding;
array best(*) best1--best&nvar;
array r(*) r1--r&nvar;
array p(*) p1--p&nvar;
do i=1 to dim(best);
variable=best(i);
hcorr=r(i);
hpvalue=p(i);
rankho=i;
output;
end;
run;
proc sort data=spearman1;
by variable;
run;
proc sort data=hoeffding1;
by variable;
run;
data correlations;
merge spearman1 hoeffding1;
by variable;
run;
proc sort data=correlations;
by ranksp;
run;
proc print data=correlations width=min;
title "print of correlations";
run;
proc sql noprint;
select min(ranksp) into: vref
from (select ranksp
from correlations
having spvalue > &pvalue);
select min(rankho) into: href
from (select rankho
from correlations
having hpvalue > &pvalue);
quit;
proc sgplot data=correlations;
refline &vref / axis=y;
refline &href / axis=x;
scatter y=ranksp x=rankho / datalabel=variable;
yaxis label="Rank of Spearman";
xaxis label="Rank of Hoeffding";
title "Scatter Plot of the Ranks of Spearman vs Hoeffding";
run;
proc sql;
delete * from correlations
where ranksp>&vref and rankho>&href;
quit;
%global screened;
proc sql;
select trim(left(variable)) into: screened separated by ' ' from correlations;
quit;
%put &screened;
%mend;
libname dyps "/data02/temp/temp_hsong/product_banner";
%uniscreen(dyps.dyps_trainoversamp2, ./varfile.txt, action, .2);
There are four parameters here: the first is the data set name, the second is the txt file containing all IVs, the third is the DV and the last is the criteria set up for p-value to discard non-significant vars. To avoid dropping too many vars, usually this criteria value is set up as .5, here we set as .2 in the example.
The output is: 1) a graph shows the rank of spearman and hoeffding rank. 2) macro variable screened which contains the picked variables.
On the pic above, it means we can drop var price_change_pct, avg_rating, total_ratings and num_sellers. Let's have a look at these data(over 90% of price_change_pct is 0, and over 75% of avg_rating and total_ratings are 0.):
price_change_pct avg_rating total_ratings
After this variable screening, we would think recentview_count ps_lowest_price days_after_ps_last_view keyword_count impressions rank pool_size lowest_price ref_rank days_after_ref days_after_last_view num_sellers can pass to next step.
Next we will check the linear relation between those IVs with DVs one by one. (Question: how to check linear relation between IV and a binary DV?)