X Y Z flag
1 red brown 0
1 red red 1
1 blue brown 0
1 blue red 0
3 yellow red 0
3 yellow pink 0
3 green red 0
3 green pink 0
In proc sql, it's not difficult to use inner join to merge A and B conditional on a.x = b.x. And we need to assign value to flag as if a.y matched by b.z. At last the program can be down as:
data a;
input x y $;
cards;
1 red
1 blue
3 yellow
3 green
;
run;
data b;
input x z $;
cards;
1 brown
1 red
2 yellow
3 red
3 pink
;
run;
proc sql;
select a.x, a.y, b.z, case when a.y = b.z then 1 else 0 end as flag
from a, b
where a.x=b.x;
quit;
***************************************************************************
Somebody gives the solution with data step. It's pretty good, as following:
data c;
set a;
do i = 1 to num;
set b(rename = (x = x2)) nobs = num point = i;
output;
end;
run;
data want (drop = x2);
set c;
if y = z then flag = 1;
else flag = 0;
where x = x2;
run;
To make it clearer, here, the print of data set c is:
Obs x y x2 z
1 1 red 1 brown
2 1 red 1 red
3 1 red 2 yellow
4 1 red 3 red
5 1 red 3 pink
6 1 blue 1 brown
7 1 blue 1 red
8 1 blue 2 yellow
9 1 blue 3 red
10 1 blue 3 pink
11 3 yellow 1 brown
12 3 yellow 1 red
13 3 yellow 2 yellow
14 3 yellow 3 red
15 3 yellow 3 pink
16 3 green 1 brown
17 3 green 1 red
18 3 green 2 yellow
19 3 green 3 red
20 3 green 3 pink
http://crackman.net/?p=1194
ReplyDelete两个数据集按照某一个变量匹配进行合并时,存在多对多或者一对多的匹配结果,看看下面两个程序:
1.采取的是用DO循环语句,逐个遍历第二个数据集中每一个观测对象,根据指定的条件,输出符合条件的合并后的观测
data AllSubjects(keep=Age Subject);
input Age @;/*这里的INPUT语句指定读入DATALINES数据行中的第一个数据值13,但是由于有@符号,所以读入数据之后指针任然停留在第一行直到这一行数据读完或者
其他命令改变指针*/
do i=1 to 3;/*这里的DO LOOP到END之间可以看做一个程序块,主要执行三次INPUT 操作,每一次操作读取一个数据值,读完之后又从INPUT AGE开始,进入第二行*/
input Subject $ @;
output;
end;
datalines;
13 Math History English
14 Math Science English
run;
data schedules_dstep(drop=Age2);
set sashelp.class(where=(Age in (13,14)));
do i=1 to NSubjects;/*这里的DO LOOP-END也是一个程序块,这个程序块的目的是每执行一次就是让ALLSubjects中的每一个观测与CLASS数据集中当前的观测按照IF
中指定的条件进行匹配,符合条件的输出*/
set AllSubjects(rename=(Age=Age2)) nobs=NSubjects point=i;/*http://crackman.net/?tag=nobs 关于nobs等SET语句后面参数的说明*/
if Age=Age2 then output;/**/
end;
run;
proc sort data=schedules_dstep;
by Subject Age Name;
run;
proc print data=schedules_dstep;
by Subject;
id Subject;/*http://crackman.net/?tag=proc-print有对ID的解释*/
title1 “Many-to-many results”;
run;
2.看看SQL语句对这种匹配问题的处理
data AllSubjects(keep=Age Subject);
input Age @;
do i=1 to 3;
input Subject $ @;
output;
end;
datalines;
13 Math History English
14 Math Science English
run;
proc sql;
create table schedules_sql as
select Class.*, Subject
from sashelp.class(where=(Age in (13,14))),
AllSubjects
where Class.Age=AllSubjects.Age/*主要是在WHERE语句指定了匹配的条件,在SQL里面,其实是有一个中间体,就是Cartesian product,然后按照
where语句对这各Cartesian product进行筛选得出的结果,至于SET语句与SQL语句的这种效率的差异,可以参见本网站中的其他文章*/
order by Subject, Age, Name;
quit;
proc print data=schedules_sql;
by Subject;
id Subject;
title1 “Many-to-many results”;
run;
两个方法进行一个比较:
1.SQL方法显得代码很简洁,看上去容易理解,没哟SET里面的RENAME等语句。
2.但是SQL由于Cartesian product得生成耗费额外较多的资源,显得效率较低,但是对于数据量比较少的时候,采用SQL是一个不错的选择。