Thursday, February 10, 2011

data set match merge, an example to study zz

The original post want to match merge A and B, and then got the result as below:

X   Y         Z               flag
1   red      brown       0
1   red      red            1
1   blue     brown       0
1   blue     red            0
3   yellow  red            0
3   yellow  pink           0
3   green   red            0
3   green   pink           0

In proc sql, it's not difficult to use inner join to merge A and B conditional on a.x = b.x. And we need to assign value to flag as if a.y matched by b.z.  At last the program can be down as:



data a;
 input x y $;
 cards;
 1 red
 1 blue
 3 yellow
 3 green
 ;
run;

data b;
 input x z $;
 cards;
 1 brown
 1 red
 2 yellow
 3 red
 3 pink
 ;
run;

proc sql;
   select a.x, a.y, b.z, case when a.y = b.z then 1 else 0 end as flag 
   from a, b
   where a.x=b.x;
quit;





***************************************************************************
 Somebody gives the solution with data step. It's pretty good, as following:



 data c;
 set a;
 do i = 1 to num;
  set b(rename = (x = x2)) nobs = num point = i;
  output;
 end;
run;

data want (drop = x2);
 set c;
 if y = z then flag = 1;
 else flag = 0;
 where x = x2;
run;


To make it clearer, here, the print of data set c is:

                                 Obs    x    y         x2    z

                                    1    1    red        1    brown
                                    2    1    red        1    red
                                    3    1    red        2    yellow
                                    4    1    red        3    red
                                    5    1    red        3    pink
                                    6    1    blue       1    brown
                                    7    1    blue       1    red
                                    8    1    blue       2    yellow
                                    9    1    blue       3    red
                                   10    1    blue       3    pink
                                   11    3    yellow     1    brown
                                   12    3    yellow     1    red
                                   13    3    yellow     2    yellow
                                   14    3    yellow     3    red
                                   15    3    yellow     3    pink
                                   16    3    green      1    brown
                                   17    3    green      1    red
                                   18    3    green      2    yellow
                                   19    3    green      3    red
                                   20    3    green      3    pink

1 comment:

  1. http://crackman.net/?p=1194

    两个数据集按照某一个变量匹配进行合并时,存在多对多或者一对多的匹配结果,看看下面两个程序:
    1.采取的是用DO循环语句,逐个遍历第二个数据集中每一个观测对象,根据指定的条件,输出符合条件的合并后的观测
    data AllSubjects(keep=Age Subject);
    input Age @;/*这里的INPUT语句指定读入DATALINES数据行中的第一个数据值13,但是由于有@符号,所以读入数据之后指针任然停留在第一行直到这一行数据读完或者
    其他命令改变指针*/


    do i=1 to 3;/*这里的DO LOOP到END之间可以看做一个程序块,主要执行三次INPUT 操作,每一次操作读取一个数据值,读完之后又从INPUT AGE开始,进入第二行*/
    input Subject $ @;
    output;
    end;
    datalines;
    13 Math History English
    14 Math Science English
    run;
    data schedules_dstep(drop=Age2);
    set sashelp.class(where=(Age in (13,14)));
    do i=1 to NSubjects;/*这里的DO LOOP-END也是一个程序块,这个程序块的目的是每执行一次就是让ALLSubjects中的每一个观测与CLASS数据集中当前的观测按照IF
    中指定的条件进行匹配,符合条件的输出*/
    set AllSubjects(rename=(Age=Age2)) nobs=NSubjects point=i;/*http://crackman.net/?tag=nobs 关于nobs等SET语句后面参数的说明*/
    if Age=Age2 then output;/**/
    end;
    run;
    proc sort data=schedules_dstep;
    by Subject Age Name;
    run;
    proc print data=schedules_dstep;
    by Subject;
    id Subject;/*http://crackman.net/?tag=proc-print有对ID的解释*/
    title1 “Many-to-many results”;
    run;

    2.看看SQL语句对这种匹配问题的处理
    data AllSubjects(keep=Age Subject);
    input Age @;
    do i=1 to 3;
    input Subject $ @;
    output;
    end;
    datalines;
    13 Math History English
    14 Math Science English
    run;
    proc sql;
    create table schedules_sql as
    select Class.*, Subject
    from sashelp.class(where=(Age in (13,14))),
    AllSubjects
    where Class.Age=AllSubjects.Age/*主要是在WHERE语句指定了匹配的条件,在SQL里面,其实是有一个中间体,就是Cartesian product,然后按照
    where语句对这各Cartesian product进行筛选得出的结果,至于SET语句与SQL语句的这种效率的差异,可以参见本网站中的其他文章*/
    order by Subject, Age, Name;
    quit;
    proc print data=schedules_sql;
    by Subject;
    id Subject;
    title1 “Many-to-many results”;
    run;

    两个方法进行一个比较:
    1.SQL方法显得代码很简洁,看上去容易理解,没哟SET里面的RENAME等语句。
    2.但是SQL由于Cartesian product得生成耗费额外较多的资源,显得效率较低,但是对于数据量比较少的时候,采用SQL是一个不错的选择。

    ReplyDelete