Friday, October 28, 2011

Convert Categorical Variables to Dummy Variables

最简单的就是用if一个个的写;

其次是用array+do循环来做,这个categorical variable的值比较好才好,最好是1,2,3,4,5这种;别的值不是这么方便,因为不好放进do循环里面;用macro %qscan可以,但是麻烦;

第三个就是上次贴的一个macro,也是麻烦,而且做interaction不方便。

今天面试的时候被教了一招,用proc glmmod;举例如下:


options formdlim='$';
data name;
      input name $ sex$;
        x=1;
      cards;
      zhao f
      qian f
      sun  m
      li   m
      zou  m
      wu   m
      zeng f
      wang m
      zhao f
      wang f
      sun  m
      wu   f
      ;
run;

proc glmmod data=name outdesign=design outparm=parm;
      class name sex;
      model x = name sex name*sex;
run;
quit;

这时SAS产生两个data set,分别是design和parm;design中的变量(列)名都是以col开头的,这样方便于在proc reg或者别的地方调用;那么每个column到底指的什么意思,或者说对应的categorical variable的什么值?parm就是来说明这个问题的。parm指明了design中每个column的含义;

比如下面两表,parm说明:design中1指intercept,2-9指name,10-11是sex,后面的是interaction。这就清晰明了了。

PARM:

Design:

Wednesday, October 26, 2011

Analysis of variance of Intercept estimates and Slope estimates (proc glm, proc reg, proc genmod)

The background is from An Introduction of Generalized Linear Model by A.J.Dobson; It's described as below:

The data are from an experiment to promote the recovery of stroke patients. There were three experimental groups:

A was a new occupational therapy intervention;
B was the existing stroke rehabilitation program conducted in the same hospital where A was conducted;
C was the usual care regime for stroke patients provided in a different hospital.

There were eight patients in each experimental group. The response variable was a measure off unctional ability, the Bartel index; higher scores correspond to better outcomes and the maximum score is 100. Each patient was assessed weekly over the eight weeks of the study.

data table11_1;
  input Subject    Group $     week1-week8;
datalines;
1     A     45    45    45    45    80    80    80    90
2     A     20    25    25    25    30    35    30    50
3     A     50    50    55    70    70    75    90    90
4     A     25    25    35    40    60    60    70    80
5     A     100   100   100   100   100   100   100   100
6     A     20    20    30    50    50    60    85    95
7     A     30    35    35    40    50    60    75    85
8     A     30    35    45    50    55    65    65    70
9     B     40    55    60    70    80    85    90    90
10    B     65    65    70    70    80    80    80    80
11    B     30    30    40    45    65    85    85    85
12    B     25    35    35    35    40    45    45    45
13    B     45    45    80    80    80    80    80    80
14    B     15    15    10    10    10    20    20    20
15    B     35    35    35    45    45    45    50    50
16    B     40    40    40    55    55    55    60    65
17    C     20    20    30    30    30    30    30    30
18    C     35    35    35    40    40    40    40    40
19    C     35    35    35    40    40    40    45    45
20    C     45    65    65    65    80    85    95    100
21    C     45    65    70    90    90    95    95    100
22    C     25    30    30    35    40    40    40    40
23    C     25    25    30    30    30    30    35    40
24    C     15    35    35    35    40    50    65    65
;
run;

Here a linear model is used to study the relation of Group and Time. The model here used is:
                                                E(Y_{i,j,k}) = \alpha_{i} + \beta_{i} * t_{k} + e_{i,j,k}
Here Y_{i,j,k} is the score at time t_{k} ( k=1 to 8) for patient j (j=1 to 8) in group i (i=1 for Group A, 2 for B and 3 for C);

For linear regression convenience, we need to rearrange the data from wide to long:

data table11_1long;
      set table11_1;
      array week(8) week1-week8;
      do time=1 to 8;
            score=week(time);
            output;
      end;
      drop week1-week8;
run;

Also we need to transfer Categorical Variable to Dummy Variable:

data table11_3a;
      set table11_1long;
      if group='B' then do; a2=1; a3=0; end;
      if group='C' then do; a2=0; a3=1; end;
      if group='A' then do; a2=0; a3=0; end;
run;

Without considering the dependence of the score for different time, we first assume they are independent. Consider the interaction of time and Group, the model can be expressed as:


proc glm data=table11_3a;
      model score = time a2 a3 time*a2 time*a3 / solution ss3;
run;
quit;

The above model considers the interaction of Group and Time. It's a saturated model. It is the same as we make a regression of Score over Time for each Group. So, it can be equally achieved by:


proc reg data=table11_1long;
      by group;
      model score = time;
run; 

Sometimes it's required an exploratory analysis, or called data reduction or data summary.  It consists of summarizing the response profiles for each subject by a small number of descriptive statistics based on the assumption that measurements on the same subject are independent. Here intercept and slope are used as the summary statistics:

proc reg data = table11_1long;
      by subject;
      model score = time;
      ods output parameterestimates = table11_4 (keep = subject variable estimate stderr);
run;
quit;

proc tabulate data=table11_4;
      var estimate stderr;
      class variable subject / order=data;
      table subject='Subject'*sum='', (variable='')*(estimate stderr)*f=6.3 / rts=20 row=float;
run;

Then we can get the following summary table:

------------------------------------------------
|                  |  Intercept  |    time     |
|                  |-------------+-------------|
|                  |Param-|      |Param-|      |
|                  | eter |Stand-| eter |Stand-|
|                  |Estim-| ard  |Estim-| ard  |
|                  | ate  |Error | ate  |Error |
|------------------+------+------+------+------|
|Subject           |      |      |      |      |
|------------------|      |      |      |      |
|1                 |30.000| 7.289| 7.500| 1.443|
|------------------+------+------+------+------|
|2                 |15.536| 4.099| 3.214| 0.812|
|------------------+------+------+------+------|
|3                 |39.821| 3.209| 6.429| 0.636|
|------------------+------+------+------+------|
|4                 |11.607| 3.387| 8.393| 0.671|
|------------------+------+------+------+------|
|5                 |100.00| 0.000| 0.000| 0.000|
|------------------+------+------+------+------|
|6                 | 0.893| 5.304|11.190| 1.050|
|------------------+------+------+------+------|
|7                 |15.357| 4.669| 7.976| 0.925|
|------------------+------+------+------+------|
|8                 |25.357| 1.971| 5.893| 0.390|
|------------------+------+------+------+------|
|9                 |38.571| 3.522| 7.262| 0.698|
|------------------+------+------+------+------|
|10                |61.964| 2.236| 2.619| 0.443|
|------------------+------+------+------+------|
|11                |14.464| 5.893| 9.702| 1.167|
|------------------+------+------+------+------|
|12                |26.071| 2.147| 2.679| 0.425|
|------------------+------+------+------+------|
|13                |48.750| 8.927| 5.000| 1.768|
|------------------+------+------+------+------|
|14                |10.179| 3.209| 1.071| 0.636|
|------------------+------+------+------+------|
|15                |31.250| 1.948| 2.500| 0.386|
|------------------+------+------+------+------|
|16                |34.107| 2.809| 3.810| 0.556|
|------------------+------+------+------+------|
|17                |21.071| 2.551| 1.429| 0.505|
|------------------+------+------+------+------|
|18                |34.107| 1.164| 0.893| 0.231|
|------------------+------+------+------+------|
|19                |32.143| 1.164| 1.607| 0.231|
|------------------+------+------+------+------|
|20                |42.321| 3.698| 7.262| 0.732|
|------------------+------+------+------+------|
|21                |48.571| 6.140| 7.262| 1.216|
|------------------+------+------+------+------|
|22                |24.821| 1.885| 2.262| 0.373|
|------------------+------+------+------+------|
|23                |22.321| 1.709| 1.845| 0.338|
|------------------+------+------+------+------|
|24                |13.036| 4.492| 6.548| 0.890|
------------------------------------------------

The last part is to do analysis of variance for intercepts and slopes. Since in Table11_4 there is only these four variables: 'subject' 'variable' 'estimate' 'stderr'. In table11_1long there is variable 'time' and 'group'. We need to combine these two data sets:



proc sql;
      create table table11_5 as
      select table11_4.*, table11_1long.group
      from  table11_4, table11_1long
      where table11_4.subject = table11_1long.subject;
quit;

Then the analysis of Intercepts can be done by:

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='Intercept';
      model estimate = group ;
run;
quit;

The analysis of Slope can be done by: 

proc genmod data=table11_5;
      class group (ref='A' param=ref);
      where variable='time';
      model estimate = group   ;
run;
quit;

Here we use proc genmod which allows us use categorical variables directly and has the choice of selecting reference level. We don't use proc glm since it has no choice of reference level in the regression.

In the book the author use proc reg to do it. If use proc reg, we need to use dummy variables rather than categorical variable. So, first it requires to combine the table table11_4 with table11_3a together:

proc sql;
      create table table11_5 as
      select distinct table11_4.*, x.a2 as a2, x.a3 as a3
      from table11_4, table11_3a as x
      where table11_4.subject = x.subject;
quit;

proc reg data=table11_5;
      where variable='Intercept';
      model estimate = a2 a3;
run;
quit;

proc reg data=table11_5;
      where variable='time';
      model estimate = a2 a3;
run;
quit;

It gives the same result as we use proc genmod.


As we know, usually the independence assumption here for the repeated measures is not suitable. A more suitable model for this data should consider the dependence of scores for these 8 time. Generalized Estimation Equations (GEE) model should be more reasonable.