Wednesday, October 5, 2011

Linear Regression with Categorical Predictors and Its interaction

Linear Regression with Categorical Predictors and Its interactions

The data set we use is elemapi2; variable mealcat is the percentage of free meals in 3 categories (mealcat=1, 2, 3); collcat is three different collections. They are both categorical variables.


1: Regression with only one categorical predictors

Since mealcat is categorical, we transfer it to dummy variables for linear regression. That it, mealcat1=1 if mealcat=1, else mealcat1=0; mealcat2=1 if mealcat=2 else mealcat2=0; mealcat3=1 if mealcat=3 else mealcat3=0;
The regression of api00~mealcat2 mealcat3 is (mealcat1 is as reference):
proc reg data=elemdum;
       model api00 = mealcat2  mealcat3;
       plot predicted.*mealcat;
run;
The output is:

The intercept is 805.72; Since mealcat=1 is reference group, that is:
mean of api00 at mealcat=1 is 805.72;
the coefficient for mealcat2 is the difference between mealcat=2 and mealcat=1;
Mean of api00 at mealcat=2 is 805.72-166.32=639.40;
The coefficient for mealcat3 is the difference between mealcat=3 and mealcat=1;
Mean of api00 at mealcat=3 is 805.72-301.34=504.38.

The result can be verified by:
proc means data=elemdum;
       var api00;
       class mealcat;
run;

mealcat
1
2
3
mean of api00
805.72
639.4
504.38

It is the same as anova result.

Since in proc glm the last category is automatically chosen as reference, it shows the same information as above.


2: Regression with two Categorical Variables

1)      here we consider there are two categorical variables, mealcat and collcat. The regression is:

proc reg data=elemdum;
       model api00 = mealcat1  mealcat2 collcat1 collcat2 ;
run;
quit;



Look at the following graph:

With respect wot mealcat, mealcat=3 is the reference; with respect to collcat, collcat=3 is the reference. So cell 9 is the reference cell, and intercept is the predicted value for this cell.

The coefficient for mealcat2 if the difference between cell 8 and cell 9; it’s also the difference between cell 5 and cell 6, and the difference between cell 2 and cell 3; in the same way we can explain mealcat1.
The coefficient for collcat2 is the difference between cell 6 and cell 9; it’s also the difference between cell 5 and cell 8, and the difference between cell 4 and cell 7; in the same way we can explain collcat1.

But if you calculate the mean for each cell, it will be different from the predicted value here. Why? Because here we only have main effect, therefore we have restrictions that the difference for some cells are the same as stated above.

2)      Next we can use anova to get the same result:

proc glm data=elemdum;
       class mealcat collcat;
       model api00 = mealcat collcat / solution ss3;
run;


The result is the same as we use dummy variables.


3: Regression with Interactions of Categorical Variables

1)      As is shown above, it’s assumed that the differences of some cells are the same. To remove this restriction, now we consider interactions of the two categorical variables.
First we need to create the dummy interactions, that is, the interaction of variable mealcat and collcat. SAS code is:
data elemdum;
       set sasreg.elemapi2 (keep=api00 some_col mealcat collcat);
       array mealdum(3) mealcat1-mealcat3;
       array colldum(3) collcat1-collcat3;
       array mxc(3,3) m1xc1-m1xc3 m2xc1-m2xc3 m3xc1-m3xc3 ;
       do i=1 to 3;
              do j=1 to 3;
                     mealdum(i)=(mealcat=i);
                     colldum(j)=(collcat=j);
                     mxc(i,j)=mealdum(i)*colldum(j);
              end;
       end;
       drop i j;
run;

We can check the indicators by proc freq in SAS:
proc freq data=elemdum;
       table mealcat*collcat * m1xc1*m1xc2*m1xc3*m2xc1*m2xc2*m2xc3*m3xc1*m3xc2*m3xc3 / nopercent nocum list;
run;

The option list displays two-way to n-way tables in a list format rather than as crosstabulation tables.

Now we will add the dummy variables and interactions into the regression. We also set mealcat=3 and collcat=3 as reference.
proc reg data=elemdum;
       model api00 = mealcat1 mealcat2 mealcat3 collcat1 collcat2 collcat3 m1xc1--m3xc3;
run;
The output is:

From the output we can see:
a: some variables has no coefficient estimation since they are at reference level;
b: interaction m2xc1 and m2xc2 is not sifnificant.

One important thing is what does the interaction mean? In the case without interaction, the coefficient Bmealcat1 means the difference between group mealcat=1 and the reference group mealcat=3. Here with interactions, it’s more complicated. The presence of interaction imply that the difference of mealcat=1 and mealcat=3 depends on the level of collcat. For example, the interaction term Bm1xc1 means the extent to which the difference between mealcat=1 and mealcat=3 changes when collcat=1 compared to collcat=3. That is, Bm1xc1=(c1-c7)-(c3-c9).

To be more detailed, I will list the items in each cell as below:

So, if you want to perform a test of the main effect of collcat=1 and collcat=3 when mealcat=1, you need to compare (Intercept + Bmealcat1+ Bcollcat1 + Bm1xc1) - (Intercept + Bmealcat1) = Bcollcat1 + Bm1xc1 is zero or not. That is:

The test is significant, indicating that collcat at level 1 and 3 is significant for the mealcat=1 group.

2)      The same result can be given by anova as below:
proc glm data = elemdum;
       class mealcat collcat;
       model api00 = mealcat collcat mealcat*collcat / solution;
       lsmeans mealcat*collcat / slice=collcat;
run;


At last, we can verify our result by calculating the average of each cell by proc tabulate:

Or we can get it by proc sql from SAS:
proc sql;
       create table temp as
       select mean(api00) as avg, collcat, mealcat
       from elemdum
       group by mealcat, collcat
       order by mealcat, collcat;
run;
quit;

1 comment: